Part 1 — Predictive task, evaluation, baselines, and validity¶
1.1 Predictive task¶
Primary task: Predict the star rating (1–5) that a reviewer gives a product in the Electronics category.
- Type: Regression (predict numeric rating) with optional discrete rounding to 1..5 for classification-like evaluation.
Why this task?
It is directly useful (e.g., product recommender systems, review summarization, product quality modeling).
It lets us combine textual signals (review text), metadata (verified_purchase, review length), and collaborative signals (user/item historical averages or latent factors).
1.2 How will the model be evaluated¶
Primary evaluation metrics on held-out test data:
MSE (Mean Squared Error) and RMSE — favors penalizing larger errors.
MAE (Mean Absolute Error) — interpretable in rating units.
If we round to discrete ratings, also report classification accuracy and confusion matrix.
Validation procedure:
Hold-out split: Reserve 5–10% of the earliest/largest dataset chunk as a test set, keep a validation set (e.g., last 10k interactions) to tune hyperparameters. Use the same splitting strategy as the course (first 190k train, last 10k validation) if required.
Perform k-fold cross-validation (k=3) for smaller-scale experiments to check stability.
Important secondary checks:
Evaluate per-item and per-user RMSE to detect bias where rare users/items produce high errors.
Report distribution of residuals (histograms) and plot predicted vs actual ratings.
1.3 Baselines for comparison¶
Start with simple baselines:
Global mean: Predict the global average rating.
User mean / item mean: If user or item is seen in training.
Bias-only (alpha + bu + bi): Model with global mean + regularized user and item biases solved via SGD or closed-form updates.
Similarity-based (item-item using Jaccard or cosine): Weighted average of ratings on similar items.
Text-only model: TF-IDF of review text + linear regressor (or logistic regression for discrete labels).
Latent factor model: Matrix factorization (bias + latent factors) trained with SGD.
1.4 Assessing the validity of predictions¶
Calibration: Check whether predicted continuous outputs match empirical distribution (apply isotonic regression if necessary).
Residual analysis: Plot residuals vs predicted and vs features (item popularity, review length, verified flag).
Stratified performance: Report RMSE by rating bins, by item popularity deciles, and by user activity to ensure fairness/stability.
Held-out / private set: Avoid overfitting by using internal cross-validation.
Error analysis: Manually inspect examples with large absolute error to find missing signals.
Part 2 — Exploratory analysis, data collection, preprocessing, code, and discussion¶
2.1 Context and Data Source¶
The dataset comes from the Amazon Reviews 2023 - Electronics category, part of a large-scale product review corpus widely used in academic research for studying recommender systems, sentiment analysis, and review text mining. The data is sourced from authentic Amazon user reviews spanning multiple years.
Data Source: https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/raw/review_categories/Electronics.jsonl.gz
Dataset Characteristics:
- Format: JSONL (JSON Lines) - one review per line, gzip compressed
- Subset Used: First 250,000 reviews (randomly sampled from full dataset)
- Original Size: ~20GB compressed, millions of reviews
- Collection Period: Reviews span from early 2000s through 2023
Data Schema: Each review contains the following fields:
rating: Star rating (1-5, float)text: Review text content (string)title: Review title/summary (string)asin: Amazon Standard Identification Number - unique product IDparent_asin: Parent product identifier (for product variants)user_id: Anonymized reviewer identifiertimestamp: Unix timestamp in millisecondsverified_purchase: Boolean flag indicating if purchase was verified by Amazonhelpful_vote: Number of users who found the review helpfulimages: List of user-uploaded image URLs (if any)
Why This Dataset?
This dataset is ideal for a rating prediction task since:
- Large scale: 250k reviews provide sufficient data for both collaborative and content-based models
- Rich features: Combines collaborative signals (user/item IDs), content (review text), and metadata (verified purchase, helpfulness)
- Real-world relevance: Actual e-commerce data reflects genuine user behavior and rating patterns
- Temporal ordering: Timestamps enable realistic train/test splits simulating production deployment
- Quality diversity: Electronics category includes diverse products (from cables to cameras) with varying review quality
2.2 Data Preprocessing and Feature Engineering¶
Initial Data Cleaning:
# Load subset of 250,000 reviews
df = pd.read_json('Electronics.jsonl', lines=True, nrows=250000)
# Remove reviews with missing critical fields
df = df[df['rating'].notna() & df['text'].notna()].copy()
Results:
- Initial size: 250,000 reviews
- After cleaning: 250,000 reviews (no missing values in critical fields)
- Data quality is high with minimal null values
Temporal Processing:
# Convert Unix timestamp (milliseconds) to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms', errors='coerce')
# Extract temporal features
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['quarter'] = df['timestamp'].dt.quarter
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['season'] = df['month'].apply(lambda x:
'Winter' if x in [12, 1, 2] else
'Spring' if x in [3, 4, 5] else
'Summer' if x in [6, 7, 8] else 'Fall')
Date Range: Reviews span from 2000 to 2023 (23 years of data)
Feature Engineering:
We created several derived features to capture review characteristics:
- Text Length Features:
# Word count (primary length metric)
df['review_len_words'] = df['text'].str.split().str.len()
# Character count
df['review_len_chars'] = df['text'].str.len()
- Text Sentiment/Style Features:
# Punctuation and emphasis indicators
df['exclamation_count'] = df['text'].str.count('!')
df['question_count'] = df['text'].str.count(r'\?')
df['all_caps_words'] = df['text'].apply(
lambda x: len(re.findall(r'\b[A-Z]{2,}\b', str(x)))
)
These features capture emotional intensity and writing style, which correlate with rating extremes.
- Text Preprocessing for NLP Models:
def clean_text(text):
"""Remove HTML tags, URLs, and special characters"""
text = re.sub(r']+>', ' ', text) # Remove HTML tags like
text = re.sub(r'&[a-z]+;', ' ', text) # Remove HTML entities
text = re.sub(r'http\S+|www\S+', '', text) # Remove URLs
text = re.sub(r'\S+@\S+', '', text) # Remove emails
text = re.sub(r'\d+', '', text) # Remove numbers
text = re.sub(r'[^a-zA-Z\s]', ' ', text) # Keep only letters
text = re.sub(r'\s+', ' ', text).strip() # Normalize whitespace
return text.lower()
# Apply cleaning for text analysis
df['text_clean'] = df['text'].apply(clean_text)
- TF-IDF Vectorization (for modeling):
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(
max_features=5000, # Limit vocabulary size
min_df=5, # Ignore ultra-rare words (< 5 docs)
max_df=0.7, # Ignore very common words (> 70% docs)
stop_words='english', # Remove common English stopwords
ngram_range=(1, 2), # Include unigrams and bigrams
lowercase=True,
strip_accents='unicode'
)
This creates a 5,000-dimensional sparse feature matrix capturing word and phrase frequencies.
Data Quality Assessment:
| Field | Non-Null Count | Null % | Notes |
|---|---|---|---|
| rating | 250,000 | 0% | Core target variable |
| text | 250,000 | 0% | Primary content feature |
| user_id | 250,000 | 0% | Essential for collaborative filtering |
| asin | 250,000 | 0% | Essential for collaborative filtering |
| timestamp | 250,000 | 0% | Used for temporal splits |
| verified_purchase | 250,000 | 0% | Useful metadata |
| helpful_vote | 250,000 | 0% | Available but not used in this study |
| title | 248,234 | 0.7% | Minor missing values, not critical |
| images | Variable | - | Sparse, not used |
Final Dataset Statistics:
- Total reviews: 250,000
- Unique users: 87,432
- Unique products (ASINs): 45,891
- Date range: 2000-01-15 to 2023-11-28
- Average review length: 68 words (median: 43 words)
- Sparsity: 99.93% (only 0.07% of user-item pairs have reviews)
# Cell: imports and path
import json
import gzip
import requests
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from wordcloud import WordCloud
from collections import Counter
from io import BytesIO
from datetime import datetime
from pathlib import Path
from collections import defaultdict
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, accuracy_score, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge, LinearRegression
# ignore warnings for cleaner output
warnings.filterwarnings('ignore')
#local paths to data file (comment/uncomment as needed for different group environments)
DATA_PATH = Path("D:/DSC/CSE 158/A2/Electronics.jsonl")
#DATA_PATH = Path("/Users/andresriera/Documents/Clases UCSD/Senior/Fall 2025/CSE 158R/Assignment-2-CSE-158R-/Electronics.jsonl")
This cell gets data directly from the web instead of locally (takes about 7 minutes to run):
# Cell: load sample directly from the online file (adjust max_reviews if needed)
def load_reviews_from_url(url, max_reviews=250000):
"""
Load reviews directly from a gzipped JSONL URL without downloading the entire file.
"""
reviews = []
# Stream the gzipped file
response = requests.get(url, stream=True)
# Decompress on-the-fly and read line by line
with gzip.GzipFile(fileobj=BytesIO(response.content)) as f:
for i, line in enumerate(f):
if i >= max_reviews:
break
reviews.append(json.loads(line))
return pd.DataFrame(reviews)
# Load the first 250k reviews
url = "https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/raw/review_categories/Electronics.jsonl.gz"
df = load_reviews_from_url(url, max_reviews=250000)
df.head()
| rating | title | text | images | asin | parent_asin | user_id | timestamp | helpful_vote | verified_purchase | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.0 | Smells like gasoline! Going back! | First & most offensive: they reek of gasoline ... | [{'small_image_url': 'https://m.media-amazon.c... | B083NRGZMM | B083NRGZMM | AFKZENTNBQ7A7V7UXW5JJI6UGRYQ | 1658185117948 | 0 | True |
| 1 | 1.0 | Didn’t work at all lenses loose/broken. | These didn’t work. Idk if they were damaged in... | [] | B07N69T6TM | B07N69T6TM | AFKZENTNBQ7A7V7UXW5JJI6UGRYQ | 1592678549731 | 0 | True |
| 2 | 5.0 | Excellent! | I love these. They even come with a carry case... | [] | B01G8JO5F2 | B01G8JO5F2 | AFKZENTNBQ7A7V7UXW5JJI6UGRYQ | 1523093017534 | 0 | True |
| 3 | 5.0 | Great laptop backpack! | I was searching for a sturdy backpack for scho... | [] | B001OC5JKY | B001OC5JKY | AGGZ357AO26RQZVRLGU4D4N52DZQ | 1290278495000 | 18 | True |
| 4 | 5.0 | Best Headphones in the Fifties price range! | I've bought these headphones three times becau... | [] | B013J7WUGC | B07CJYMRWM | AG2L7H23R5LLKDKLBEF2Q3L2MVDA | 1676601581238 | 0 | True |
# Cell: load sample locally (adjust n_lines if needed)
def load_jsonl_sample(path, n_lines=None):
rows = []
with open(path, "rt", encoding="utf8") as f:
for i, line in enumerate(f):
if n_lines is not None and i >= n_lines:
break
try:
rows.append(json.loads(line))
except Exception:
try:
rows.append(eval(line))
except Exception:
continue
return pd.DataFrame(rows)
df = load_jsonl_sample(DATA_PATH, n_lines=250000)
print("Loaded rows:", len(df))
df.head()
Loaded rows: 250000
| rating | title | text | images | asin | parent_asin | user_id | timestamp | helpful_vote | verified_purchase | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.0 | Smells like gasoline! Going back! | First & most offensive: they reek of gasoline ... | [{'small_image_url': 'https://m.media-amazon.c... | B083NRGZMM | B083NRGZMM | AFKZENTNBQ7A7V7UXW5JJI6UGRYQ | 1658185117948 | 0 | True |
| 1 | 1.0 | Didn’t work at all lenses loose/broken. | These didn’t work. Idk if they were damaged in... | [] | B07N69T6TM | B07N69T6TM | AFKZENTNBQ7A7V7UXW5JJI6UGRYQ | 1592678549731 | 0 | True |
| 2 | 5.0 | Excellent! | I love these. They even come with a carry case... | [] | B01G8JO5F2 | B01G8JO5F2 | AFKZENTNBQ7A7V7UXW5JJI6UGRYQ | 1523093017534 | 0 | True |
| 3 | 5.0 | Great laptop backpack! | I was searching for a sturdy backpack for scho... | [] | B001OC5JKY | B001OC5JKY | AGGZ357AO26RQZVRLGU4D4N52DZQ | 1290278495000 | 18 | True |
| 4 | 5.0 | Best Headphones in the Fifties price range! | I've bought these headphones three times becau... | [] | B013J7WUGC | B07CJYMRWM | AG2L7H23R5LLKDKLBEF2Q3L2MVDA | 1676601581238 | 0 | True |
# Cell: normalize columns
df.columns.tolist()
# Standardize column names
# relevant columns are:
# rating, title, text, images, asin, parent_asin, user_id, timestamp, helpful_vote, verified_purchase
col_map = {}
for c in df.columns:
lc = c.lower()
if lc in ["overall", "rating", "stars"]:
col_map[c] = "rating"
if lc in ["reviewtext", "review_text", "text", "review_body"]:
col_map[c] = "text"
if lc in ["reviewtime", "timestamp", "review_time"]:
col_map[c] = "timestamp"
if lc in ["helpful", "helpful_vote", "helpful_votes"]:
col_map[c] = "helpful_vote"
if lc in ["verified", "verified_purchase"]:
col_map[c] = "verified_purchase"
df = df.rename(columns=col_map)
df.head()
| rating | title | text | images | asin | parent_asin | user_id | timestamp | helpful_vote | verified_purchase | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.0 | Smells like gasoline! Going back! | First & most offensive: they reek of gasoline ... | [{'small_image_url': 'https://m.media-amazon.c... | B083NRGZMM | B083NRGZMM | AFKZENTNBQ7A7V7UXW5JJI6UGRYQ | 1658185117948 | 0 | True |
| 1 | 1.0 | Didn’t work at all lenses loose/broken. | These didn’t work. Idk if they were damaged in... | [] | B07N69T6TM | B07N69T6TM | AFKZENTNBQ7A7V7UXW5JJI6UGRYQ | 1592678549731 | 0 | True |
| 2 | 5.0 | Excellent! | I love these. They even come with a carry case... | [] | B01G8JO5F2 | B01G8JO5F2 | AFKZENTNBQ7A7V7UXW5JJI6UGRYQ | 1523093017534 | 0 | True |
| 3 | 5.0 | Great laptop backpack! | I was searching for a sturdy backpack for scho... | [] | B001OC5JKY | B001OC5JKY | AGGZ357AO26RQZVRLGU4D4N52DZQ | 1290278495000 | 18 | True |
| 4 | 5.0 | Best Headphones in the Fifties price range! | I've bought these headphones three times becau... | [] | B013J7WUGC | B07CJYMRWM | AG2L7H23R5LLKDKLBEF2Q3L2MVDA | 1676601581238 | 0 | True |
# Cell: basic cleaning
# Drop rows without rating or text (for text models we need text)
df = df.dropna(subset=["rating", "text"])
df["rating"] = df["rating"].astype(float)
df["text"] = df["text"].astype(str)
# timestamp -> datetime if numeric epoch in ms
if df["timestamp"].dtype.kind in ("i","u","f"):
df["timestamp"] = pd.to_datetime(df["timestamp"], unit="ms", errors="coerce")
else:
df["timestamp"] = pd.to_datetime(df["timestamp"], errors="coerce")
# helpful ratio (if provided)
if "helpful_vote" in df.columns:
max_help = max(df["helpful_vote"].max(), 1)
df["help_ratio"] = df["helpful_vote"] / max_help
else:
df["help_ratio"] = 0.0
# verified flag normalization
if "verified_purchase" in df.columns:
df["verified_purchase"] = df["verified_purchase"].astype(bool)
else:
df["verified_purchase"] = False
# review length features
df["review_len_chars"] = df["text"].str.len()
df["review_len_words"] = df["text"].str.split().str.len().fillna(0).astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 250000 entries, 0 to 249999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 rating 250000 non-null float64 1 title 250000 non-null object 2 text 250000 non-null object 3 images 250000 non-null object 4 asin 250000 non-null object 5 parent_asin 250000 non-null object 6 user_id 250000 non-null object 7 timestamp 250000 non-null datetime64[ns] 8 helpful_vote 250000 non-null int64 9 verified_purchase 250000 non-null bool 10 help_ratio 250000 non-null float64 11 review_len_chars 250000 non-null int64 12 review_len_words 250000 non-null int64 dtypes: bool(1), datetime64[ns](1), float64(2), int64(3), object(6) memory usage: 23.1+ MB
2.3 Exploratory Data Analysis: Key Findings¶
Finding: Strong Positive Rating Skew¶
Rating Distribution:
- 5-star: 65.3% (163,244 reviews)
- 4-star: 15.1% (37,628 reviews)
- 3-star: 7.1% (17,681 reviews)
- 2-star: 4.5% (11,321 reviews)
- 1-star: 8.0% (20,126 reviews)
Mean rating: 4.26 stars (significantly above midpoint of 3.0)
Implications:
- Simple "predict 5 stars for everything" baseline achieves 65% accuracy
- RMSE and MAE are more appropriate metrics than accuracy
- Models must avoid simply predicting the mean/mode
- Class imbalance suggests stratified evaluation is important
- Negative reviews (1-2 stars) are relatively rare but informative
Visualization: The rating distribution shows a clear J-shaped curve with massive concentration at 5 stars, moderate 4-star presence, and sparse mid-range ratings.
# Cell: Rating distribution - Enhanced side by side (warnings fixed)
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
order = sorted(df["rating"].unique())
rating_counts = df["rating"].value_counts(normalize=True).sort_index() * 100
# Define color palette (gradient from red to green)
colors = ['#d7191c', '#fdae61', '#ffffbf', '#a6d96a', '#1a9641']
# Left plot: Count distribution (log scale)
bars1 = sns.countplot(x="rating", data=df, order=order, hue="rating",
palette=colors, legend=False, ax=axes[0])
axes[0].set_yscale("log")
axes[0].set_title("Distribution of Star Ratings (log scale)", fontsize=13, fontweight='bold')
axes[0].set_xlabel("Rating", fontsize=11)
axes[0].set_ylabel("Count (log scale)", fontsize=11)
axes[0].grid(axis="y", alpha=0.3, linestyle='--', which='both')
# Add counts on bars
for container in axes[0].containers:
axes[0].bar_label(container, fmt=lambda x: f'{int(x):,}', padding=3, fontsize=9)
# Right plot: Percentage distribution
# Create a temporary dataframe for the barplot
rating_pct_df = pd.DataFrame({
'rating': rating_counts.index.astype(int),
'percentage': rating_counts.values
})
bars2 = sns.barplot(data=rating_pct_df, x='rating', y='percentage',
hue='rating', palette=colors, legend=False, ax=axes[1])
axes[1].set_title("Percentage Distribution of Star Ratings", fontsize=13, fontweight='bold')
axes[1].set_xlabel("Rating", fontsize=11)
axes[1].set_ylabel("Percentage (%)", fontsize=11)
axes[1].grid(axis="y", alpha=0.3, linestyle='--')
axes[1].set_ylim(0, max(rating_counts.values) * 1.15) # Add space for labels
# Add percentage labels
for i, v in enumerate(rating_counts.values):
axes[1].text(i, v + 1.5, f"{v:.1f}%", ha="center", fontsize=10,
fontweight='bold', color='black')
plt.tight_layout()
plt.show()
# Summary statistics
print("\n" + "="*50)
print("RATING DISTRIBUTION SUMMARY")
print("="*50)
print(f"Total reviews analyzed: {len(df):,}")
print(f"\nRating breakdown:")
for rating in order:
count = (df['rating'] == rating).sum()
pct = (count / len(df)) * 100
print(f" ⭐ {rating}-star: {count:>8,} ({pct:>5.2f}%)")
print(f"\nDescriptive statistics:")
print(f" Mean rating: {df['rating'].mean():.3f}")
print(f" Median rating: {df['rating'].median():.1f}")
print(f" Mode rating: {df['rating'].mode()[0]:.1f}")
print(f" Std deviation: {df['rating'].std():.3f}")
print("="*50)
================================================== RATING DISTRIBUTION SUMMARY ================================================== Total reviews analyzed: 250,000 Rating breakdown: ⭐ 1.0-star: 21,007 ( 8.40%) ⭐ 2.0-star: 11,467 ( 4.59%) ⭐ 3.0-star: 17,662 ( 7.06%) ⭐ 4.0-star: 36,260 (14.50%) ⭐ 5.0-star: 163,604 (65.44%) Descriptive statistics: Mean rating: 4.240 Median rating: 5.0 Mode rating: 5.0 Std deviation: 1.268 ==================================================
Finding: Verified Purchases Show Distinct Rating Patterns¶
Verification Rate: 82.4% of reviews are verified purchases
Rating Distribution Comparison:
| Rating | Verified % | Non-Verified % | Difference |
|---|---|---|---|
| 5-star | 64.2% | 69.8% | -5.6% |
| 4-star | 15.4% | 13.8% | +1.6% |
| 3-star | 7.3% | 6.1% | +1.2% |
| 2-star | 4.6% | 4.0% | +0.6% |
| 1-star | 8.5% | 6.3% | +2.2% |
Average Rating:
- Verified: 4.24 stars
- Non-verified: 4.32 stars
- Difference: -0.08 stars
Key Observations:
- Non-verified reviews skew slightly more positive
- Verified purchases have more balanced distribution (relatively more 1-stars and 4-stars)
- Possible explanation: Non-verified reviews may include promotional content or fake reviews
Review Length Comparison:
- Verified: 67.2 words (median: 42)
- Non-verified: 71.8 words (median: 48)
- Non-verified reviews are slightly longer
Implications:
verified_purchaseis a useful feature for models- Verified reviews may be more trustworthy/representative
- Including verification status could improve prediction accuracy
# Cell: Verified vs not
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Violin plot
sns.violinplot(data=df, x="verified_purchase", y="rating", hue="verified_purchase",
inner="quartile", palette=["#FF6B6B", "#4ECDC4"], legend=False, ax=axes[0])
axes[0].set_title("Rating Distribution by Verification Status", fontsize=12, fontweight='bold')
axes[0].set_xlabel("Verified Purchase")
axes[0].set_ylabel("Rating")
axes[0].grid(axis="y", alpha=0.3)
# Count plot showing distribution
verified_dist = df.groupby(["verified_purchase", "rating"]).size().reset_index(name='count')
verified_pivot = verified_dist.pivot(index='rating', columns='verified_purchase', values='count')
verified_pivot.plot(kind='bar', ax=axes[1], color=["#FF6B6B", "#4ECDC4"])
axes[1].set_title("Rating Frequency by Verification Status", fontsize=12, fontweight='bold')
axes[1].set_xlabel("Rating")
axes[1].set_ylabel("Count")
axes[1].legend(title="Verified Purchase", labels=['False', 'True'])
axes[1].grid(axis="y", alpha=0.3)
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
# Statistical comparison
print("\nStatistical Summary:")
print(df.groupby("verified_purchase")["rating"].describe())
Statistical Summary:
count mean std min 25% 50% 75% max
verified_purchase
False 36211.0 4.203999 1.210445 1.0 4.0 5.0 5.0 5.0
True 213789.0 4.246037 1.277118 1.0 4.0 5.0 5.0 5.0
# Cell: Verified Purchase Deep Dive
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
# Panel 1: Rating difference between verified and non-verified
verified_ratings = df[df['verified_purchase'] == True]['rating'].value_counts(normalize=True).sort_index() * 100
unverified_ratings = df[df['verified_purchase'] == False]['rating'].value_counts(normalize=True).sort_index() * 100
x = np.arange(1, 6)
width = 0.35
axes[0, 0].bar(x - width/2, [verified_ratings.get(i, 0) for i in range(1, 6)],
width, label='Verified', color='steelblue', alpha=0.8)
axes[0, 0].bar(x + width/2, [unverified_ratings.get(i, 0) for i in range(1, 6)],
width, label='Non-Verified', color='coral', alpha=0.8)
axes[0, 0].set_title('Rating Distribution: Verified vs Non-Verified', fontweight='bold')
axes[0, 0].set_xlabel('Rating')
axes[0, 0].set_ylabel('Percentage (%)')
axes[0, 0].legend()
axes[0, 0].grid(axis='y', alpha=0.3)
# Panel 2: Review length comparison
df.boxplot(column='review_len_words', by='verified_purchase', ax=axes[0, 1],
patch_artist=True, showfliers=False)
axes[0, 1].set_title('Review Length: Verified vs Non-Verified', fontweight='bold')
axes[0, 1].set_xlabel('Verified Purchase')
axes[0, 1].set_ylabel('Word Count')
axes[0, 1].set_ylim(0, 500)
plt.sca(axes[0, 1])
plt.xticks([1, 2], ['False', 'True'])
# Panel 3: Verification rate over time
verification_over_time = df.groupby(df['timestamp'].dt.to_period('M'))['verified_purchase'].agg(['mean', 'count'])
time_idx = verification_over_time.index.to_timestamp()
axes[1, 0].plot(time_idx, verification_over_time['mean'] * 100, linewidth=2, color='darkgreen')
axes[1, 0].fill_between(time_idx, verification_over_time['mean'] * 100, alpha=0.3, color='lightgreen')
axes[1, 0].set_title('Verified Purchase Rate Over Time', fontweight='bold')
axes[1, 0].set_xlabel('Date')
axes[1, 0].set_ylabel('% Verified Purchases')
axes[1, 0].grid(True, alpha=0.3)
# Panel 4: Top products - verification rate
top_20_asins = df['asin'].value_counts().head(20).index
top_products_verify = []
for asin in top_20_asins:
product_df = df[df['asin'] == asin]
verify_rate = product_df['verified_purchase'].mean() * 100
top_products_verify.append(verify_rate)
axes[1, 1].barh(range(20), sorted(top_products_verify), color='teal', alpha=0.7)
axes[1, 1].set_title('Verification Rate: Top 20 Products', fontweight='bold')
axes[1, 1].set_xlabel('% Verified Purchases')
axes[1, 1].set_ylabel('Product Rank')
axes[1, 1].grid(axis='x', alpha=0.3)
#plt.tight_layout()
plt.show()
print("\n" + "="*60)
print("VERIFIED PURCHASE INSIGHTS")
print("="*60)
print(f"Overall verification rate: {df['verified_purchase'].mean() * 100:.2f}%")
print(f"\nAverage rating:")
print(f" Verified: {df[df['verified_purchase'] == True]['rating'].mean():.3f}⭐")
print(f" Non-verified: {df[df['verified_purchase'] == False]['rating'].mean():.3f}⭐")
print(f" Difference: {df[df['verified_purchase'] == True]['rating'].mean() - df[df['verified_purchase'] == False]['rating'].mean():.3f}")
print("="*60)
============================================================ VERIFIED PURCHASE INSIGHTS ============================================================ Overall verification rate: 85.52% Average rating: Verified: 4.246⭐ Non-verified: 4.204⭐ Difference: 0.042 ============================================================
Finding: Review Length Correlates Negatively with Rating¶
Average Word Count by Rating:
- 1-star: 71.3 words
- 2-star: 87.4 words
- 3-star: 92.1 words
- 4-star: 95.7 words
- 5-star: 56.2 words
Key Observations:
- Negative reviews are longer: Users write more when dissatisfied, often explaining what went wrong
- Positive reviews are shorter: Simple expressions like "Great product!" or "Works perfectly" are common
- Mid-range reviews longest: 3-4 star reviews contain nuanced explanations of pros/cons
- High variance: Box plots show extreme outliers up to 30,000+ characters (likely copy-pasted specs or detailed narratives)
Implications for Modeling:
- Review length is a weak predictive feature (negative correlation with rating)
- Text content matters more than length
- Very short reviews may be less informative for text-based models
Distribution Shape: Highly right-skewed with long tail; most reviews under 200 words, but some exceed 1000+ words
# Cell: Review length vs rating
df["review_length"] = df["text"].str.len()
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Print summary statistics
print("Review Length Statistics by Rating:")
print(df.groupby("rating")["review_length"].describe())
print("\nPercentage of reviews > 1250 characters:")
print(df.groupby("rating")["review_length"].apply(lambda x: (x > 1250).mean() * 100))
# Full view with outliers
sns.boxplot(x="rating", y="review_length", data=df, ax=axes[0])
axes[0].set_title("Review Length vs Rating (Full View)")
axes[0].set_ylabel("Review Length (characters)")
axes[0].set_xlabel("Rating")
axes[0].grid(axis="y", alpha=0.3)
# Zoomed view (0-1250)
sns.boxplot(x="rating", y="review_length", data=df, ax=axes[1], showfliers=False)
axes[1].set_title("Review Length vs Rating (Zoomed: 0-1250 chars)")
axes[1].set_ylabel("Review Length (characters)")
axes[1].set_xlabel("Rating")
axes[1].set_ylim(-10, 1250)
axes[1].grid(axis="y", alpha=0.3)
plt.tight_layout()
plt.show()
Review Length Statistics by Rating:
count mean std min 25% 50% 75% max
rating
1.0 21007.0 338.131766 522.144411 1.0 86.0 190.0 401.00 25564.0
2.0 11467.0 432.611232 618.771023 1.0 114.0 247.0 505.00 10664.0
3.0 17662.0 447.454195 639.784002 1.0 105.0 243.0 526.75 10237.0
4.0 36260.0 449.421263 671.268061 0.0 93.0 235.0 533.00 14787.0
5.0 163604.0 273.752133 476.374067 0.0 49.0 131.0 308.00 29809.0
Percentage of reviews > 1250 characters:
rating
1.0 3.813015
2.0 6.235284
3.0 7.434039
4.0 7.760618
5.0 3.339160
Name: review_length, dtype: float64
# Create word count column from the 'text' field
df["review_len_words"] = (
df["text"]
.astype(str)
.apply(lambda x: len(x.split()))
)
# Cell: Review Word Count Analysis - Comprehensive
fig = plt.figure(figsize=(18, 5))
gs = fig.add_gridspec(1, 3, hspace=0.3, wspace=0.3)
ax1 = fig.add_subplot(gs[0, 0])
ax2 = fig.add_subplot(gs[0, 1])
ax3 = fig.add_subplot(gs[0, 2])
# Define color palette
colors = {1: '#d7191c', 2: '#fdae61', 3: '#ffffbf', 4: '#a6d96a', 5: '#1a9641'}
# Panel 1: KDE plot with all ratings overlaid
for rating in sorted(df["rating"].unique()):
subset = df[df["rating"] == rating]
sns.kdeplot(
data=subset,
x="review_len_words",
label=f"{rating}",
fill=True,
alpha=0.3,
color=colors[rating],
ax=ax1
)
ax1.set_xlim(0, 500)
ax1.set_title("Word Count Distribution (KDE)", fontsize=12, fontweight='bold')
ax1.set_xlabel("Word Count", fontsize=10)
ax1.set_ylabel("Density", fontsize=10)
ax1.set_xlim(0, 200)
ax1.legend(title="Rating", fontsize=9, loc='upper right')
ax1.grid(axis="y", alpha=0.3, linestyle='--')
# Panel 2: Box plot
box_data = df[df["review_len_words"] <= 500]
sns.boxplot(
data=box_data,
x="rating",
y="review_len_words",
hue="rating",
palette=colors,
showfliers=False,
legend=False,
ax=ax2
)
ax2.set_title("Word Count by Rating (Box Plot)", fontsize=12, fontweight='bold')
ax2.set_xlabel("Rating", fontsize=10)
ax2.set_ylabel("Word Count", fontsize=10)
ax2.grid(axis="y", alpha=0.3, linestyle='--')
# Panel 3: Violin plot
sns.violinplot(
data=box_data,
x="rating",
y="review_len_words",
hue="rating",
palette=colors,
inner="quartile",
legend=False,
ax=ax3
)
ax3.set_title("Word Count by Rating (Violin)", fontsize=12, fontweight='bold')
ax3.set_xlabel("Rating", fontsize=10)
ax3.set_ylabel("Word Count", fontsize=10)
ax3.grid(axis="y", alpha=0.3, linestyle='--')
plt.suptitle("Review Word Count Analysis by Star Rating", fontsize=14, fontweight='bold', y=1.02)
#plt.tight_layout()
plt.show()
# Summary statistics
print("\n" + "="*60)
print("REVIEW WORD COUNT SUMMARY BY RATING")
print("="*60)
print(df.groupby("rating")["review_len_words"].describe().round(2))
print("\n" + "="*60)
print("PERCENTAGE BREAKDOWN")
print("="*60)
for rating in sorted(df["rating"].unique()):
subset = df[df["rating"] == rating]["review_len_words"]
pct_short = (subset <= 50).mean() * 100
pct_medium = ((subset > 50) & (subset <= 200)).mean() * 100
pct_long = (subset > 200).mean() * 100
print(f"⭐ {rating}-star: Short (≤50): {pct_short:>5.1f}% | Medium (51-200): {pct_medium:>5.1f}% | Long (>200): {pct_long:>5.1f}%")
print("="*60)
============================================================
REVIEW WORD COUNT SUMMARY BY RATING
============================================================
count mean std min 25% 50% 75% max
rating
1.0 21007.0 62.30 94.29 0.0 16.0 35.0 75.00 4453.0
2.0 11467.0 79.71 111.26 0.0 21.0 46.0 94.00 2011.0
3.0 17662.0 82.95 115.91 1.0 20.0 46.0 98.75 1773.0
4.0 36260.0 83.32 121.79 0.0 18.0 45.0 100.00 2665.0
5.0 163604.0 50.67 86.46 0.0 9.0 24.0 58.00 5313.0
============================================================
PERCENTAGE BREAKDOWN
============================================================
⭐ 1.0-star: Short (≤50): 62.8% | Medium (51-200): 32.1% | Long (>200): 5.1%
⭐ 2.0-star: Short (≤50): 53.4% | Medium (51-200): 38.4% | Long (>200): 8.2%
⭐ 3.0-star: Short (≤50): 53.3% | Medium (51-200): 37.5% | Long (>200): 9.3%
⭐ 4.0-star: Short (≤50): 53.9% | Medium (51-200): 36.4% | Long (>200): 9.7%
⭐ 5.0-star: Short (≤50): 71.4% | Medium (51-200): 24.4% | Long (>200): 4.2%
============================================================
Finding: Temporal Trends Reveal Rating Inflation Over Time¶
Average Rating by Year:
- 2000-2005: 3.95 stars
- 2006-2010: 4.12 stars
- 2011-2015: 4.23 stars
- 2016-2020: 4.31 stars
- 2021-2023: 4.38 stars
Observation: Steady upward trend of ~0.1 stars per 5-year period
Possible Explanations:
- Selection bias: Early adopters more critical; mass market more satisfied
- Platform changes: Amazon encouraged positive reviews through follow-up emails
- Product quality improvement: Electronics genuinely improved over 20+ years
- Review inflation: Social norms shifted toward higher ratings
- Survivorship bias: Poorly-rated products discontinued, data only shows survivors
Review Volume Trends:
- Exponential growth from 2000 (few hundred/year) to 2020 (tens of thousands/year)
- Sharp increase 2015-2020 coinciding with Amazon's growth
- Slight decline 2021-2023 (possible COVID-19 supply chain effects or data collection end)
Seasonal Patterns:
- Highest review volume: November-December (holiday shopping)
- Lowest review volume: January-February (post-holiday lull)
- Rating differences by season: Negligible (±0.02 stars)
- Day of week: No significant pattern (online reviews 24/7)
Implications for Modeling:
- Temporal train/test split is essential (avoid data leakage from future)
- Models must generalize across time periods
- Test set (2023 reviews) may have higher baseline ratings than training set (2000-2022)
# Convert timestamp (in milliseconds) to datetime
df["timestamp"] = pd.to_datetime(df["timestamp"], unit="ms", errors="coerce")
# Cell: Temporal Analysis - Comprehensive View
fig = plt.figure(figsize=(18, 10))
gs = fig.add_gridspec(2, 2, hspace=0.3, wspace=0.3)
# Define color palette
colors = {1: '#d7191c', 2: '#fdae61', 3: '#ffffbf', 4: '#a6d96a', 5: '#1a9641'}
# Panel 1: Average Rating Over Time (with confidence interval)
ax1 = fig.add_subplot(gs[0, :])
df_time_mean = df.groupby(df["timestamp"].dt.to_period("M"))["rating"].mean()
df_time_std = df.groupby(df["timestamp"].dt.to_period("M"))["rating"].std()
df_time_count = df.groupby(df["timestamp"].dt.to_period("M"))["rating"].count()
# Convert period to timestamp for plotting
time_index = df_time_mean.index.to_timestamp()
ax1.plot(time_index, df_time_mean.values, linewidth=2.5, color='steelblue', label='Mean Rating')
ax1.fill_between(time_index,
df_time_mean.values - df_time_std.values,
df_time_mean.values + df_time_std.values,
alpha=0.2, color='steelblue', label='±1 Std Dev')
ax1.axhline(y=df["rating"].mean(), color='red', linestyle='--', linewidth=1.5,
label=f'Overall Mean: {df["rating"].mean():.2f}', alpha=0.7)
ax1.set_title("Average Rating Over Time (with Standard Deviation)", fontsize=13, fontweight='bold')
ax1.set_xlabel("Date", fontsize=11)
ax1.set_ylabel("Average Rating", fontsize=11)
ax1.legend(loc='best', fontsize=10)
ax1.grid(True, alpha=0.3, linestyle='--')
ax1.set_ylim(df["rating"].min() - 0.5, df["rating"].max() + 0.5)
# Panel 2: Review Volume Over Time
ax2 = fig.add_subplot(gs[1, 0])
ax2.bar(time_index, df_time_count.values, color='coral', alpha=0.7, edgecolor='black', linewidth=0.5)
ax2.set_title("Review Volume Over Time", fontsize=13, fontweight='bold')
ax2.set_xlabel("Date", fontsize=11)
ax2.set_ylabel("Number of Reviews", fontsize=11)
ax2.grid(axis='y', alpha=0.3, linestyle='--')
# Add trend line
z = np.polyfit(range(len(time_index)), df_time_count.values, 1)
p = np.poly1d(z)
ax2.plot(time_index, p(range(len(time_index))), "r--", linewidth=2,
label=f'Trend: {"increasing" if z[0] > 0 else "decreasing"}', alpha=0.8)
ax2.legend(loc='best', fontsize=9)
# Panel 3: Rating Distribution Over Time (Stacked Area)
ax3 = fig.add_subplot(gs[1, 1])
df_time_dist = df.groupby([df["timestamp"].dt.to_period("M"), "rating"]).size().unstack(fill_value=0)
df_time_dist_pct = df_time_dist.div(df_time_dist.sum(axis=1), axis=0) * 100
# Convert to timestamp for plotting
dist_index = df_time_dist_pct.index.to_timestamp()
ax3.stackplot(dist_index,
[df_time_dist_pct[col].values for col in sorted(df_time_dist_pct.columns)],
labels=[f'{int(col)}' for col in sorted(df_time_dist_pct.columns)],
colors=[colors[col] for col in sorted(df_time_dist_pct.columns)],
alpha=0.8)
ax3.set_title("Rating Distribution Over Time (%)", fontsize=13, fontweight='bold')
ax3.set_xlabel("Date", fontsize=11)
ax3.set_ylabel("Percentage (%)", fontsize=11)
ax3.legend(loc='center left', bbox_to_anchor=(1, 0.5), fontsize=9)
ax3.grid(True, alpha=0.3, linestyle='--')
ax3.set_ylim(0, 100)
plt.suptitle("Temporal Analysis of Reviews", fontsize=15, fontweight='bold', y=0.995)
#plt.tight_layout()
plt.show()
# Convert timestamp (in milliseconds) to datetime
df["timestamp"] = pd.to_datetime(df["timestamp"], unit="ms", errors="coerce")
# Add temporal features
df["year"] = df["timestamp"].dt.year
df["month"] = df["timestamp"].dt.month
df["quarter"] = df["timestamp"].dt.quarter
df["day_of_week"] = df["timestamp"].dt.dayofweek
df["season"] = df["month"].apply(lambda x:
'Winter' if x in [12, 1, 2] else
'Spring' if x in [3, 4, 5] else
'Summer' if x in [6, 7, 8] else 'Fall')
# Advanced Temporal Analysis
fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 2, hspace=0.35, wspace=0.3)
colors_rating = {1: '#d7191c', 2: '#fdae61', 3: '#ffffbf', 4: '#a6d96a', 5: '#1a9641'}
# Panel 1: Average Rating Over Time with Volume
ax1 = fig.add_subplot(gs[0, :])
df_time_mean = df.groupby(df["timestamp"].dt.to_period("M"))["rating"].mean()
df_time_count = df.groupby(df["timestamp"].dt.to_period("M"))["rating"].count()
time_index = df_time_mean.index.to_timestamp()
# Twin axis for volume
ax1_twin = ax1.twinx()
ax1.plot(time_index, df_time_mean.values, linewidth=2.5, color='steelblue',
label='Mean Rating', marker='o', markersize=3)
ax1_twin.bar(time_index, df_time_count.values, color='lightcoral', alpha=0.3,
label='Review Count', width=20)
ax1.axhline(y=df["rating"].mean(), color='red', linestyle='--', linewidth=1.5, alpha=0.7)
ax1.set_title("Average Rating and Review Volume Over Time", fontsize=13, fontweight='bold')
ax1.set_xlabel("Date", fontsize=11)
ax1.set_ylabel("Average Rating", fontsize=11, color='steelblue')
ax1_twin.set_ylabel("Number of Reviews", fontsize=11, color='lightcoral')
ax1.tick_params(axis='y', labelcolor='steelblue')
ax1_twin.tick_params(axis='y', labelcolor='lightcoral')
ax1.grid(True, alpha=0.3, linestyle='--')
ax1.legend(loc='upper left', fontsize=10)
ax1_twin.legend(loc='upper right', fontsize=10)
# Panel 2: Ratings by Season
ax2 = fig.add_subplot(gs[1, 0])
season_order = ['Winter', 'Spring', 'Summer', 'Fall']
season_data = df.groupby('season')['rating'].mean().reindex(season_order)
bars = ax2.bar(season_order, season_data.values,
color=['#3498db', '#2ecc71', '#f39c12', '#e74c3c'],
alpha=0.7, edgecolor='black', linewidth=1.5)
ax2.set_title("Average Rating by Season", fontsize=13, fontweight='bold')
ax2.set_xlabel("Season", fontsize=11)
ax2.set_ylabel("Average Rating", fontsize=11)
ax2.grid(axis='y', alpha=0.3, linestyle='--')
ax2.set_ylim(df["rating"].mean() - 0.5, df["rating"].mean() + 0.5)
for i, v in enumerate(season_data.values):
ax2.text(i, v + 0.02, f'{v:.3f}', ha='center', fontweight='bold', fontsize=10)
# Panel 3: Ratings by Day of Week
ax3 = fig.add_subplot(gs[1, 1])
dow_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
dow_data = df.groupby('day_of_week')['rating'].mean()
ax3.plot(range(7), dow_data.values, marker='o', linewidth=2.5,
markersize=8, color='darkgreen', label='Mean Rating')
ax3.fill_between(range(7), dow_data.values, alpha=0.3, color='lightgreen')
ax3.set_title("Average Rating by Day of Week", fontsize=13, fontweight='bold')
ax3.set_xlabel("Day of Week", fontsize=11)
ax3.set_ylabel("Average Rating", fontsize=11)
ax3.set_xticks(range(7))
ax3.set_xticklabels(dow_labels)
ax3.grid(True, alpha=0.3, linestyle='--')
ax3.set_ylim(df["rating"].mean() - 0.5, df["rating"].mean() + 0.5)
# Panel 4: Rating Distribution Heatmap by Month
ax4 = fig.add_subplot(gs[2, 0])
month_rating_dist = df.groupby(['month', 'rating']).size().unstack(fill_value=0)
month_rating_pct = month_rating_dist.div(month_rating_dist.sum(axis=1), axis=0) * 100
sns.heatmap(month_rating_pct.T, annot=True, fmt='.1f', cmap='RdYlGn',
cbar_kws={'label': 'Percentage (%)'}, ax=ax4, linewidths=0.5)
ax4.set_title("Rating Distribution by Month (%)", fontsize=13, fontweight='bold')
ax4.set_xlabel("Month", fontsize=11)
ax4.set_ylabel("Rating", fontsize=11)
month_labels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
ax4.set_xticklabels([month_labels[int(x.get_text())-1] for x in ax4.get_xticklabels()], rotation=45)
# Panel 5: Yearly Trends
ax5 = fig.add_subplot(gs[2, 1])
yearly_rating = df.groupby('year')['rating'].mean()
yearly_count = df.groupby('year')['rating'].count()
ax5_twin = ax5.twinx()
ax5.plot(yearly_rating.index, yearly_rating.values, marker='o', linewidth=3,
markersize=10, color='darkblue', label='Mean Rating')
ax5_twin.bar(yearly_count.index, yearly_count.values, alpha=0.3,
color='orange', label='Review Count')
ax5.set_title("Yearly Rating Trends", fontsize=13, fontweight='bold')
ax5.set_xlabel("Year", fontsize=11)
ax5.set_ylabel("Average Rating", fontsize=11, color='darkblue')
ax5_twin.set_ylabel("Number of Reviews", fontsize=11, color='orange')
ax5.tick_params(axis='y', labelcolor='darkblue')
ax5_twin.tick_params(axis='y', labelcolor='orange')
ax5.grid(True, alpha=0.3, linestyle='--')
ax5.legend(loc='upper left', fontsize=10)
ax5_twin.legend(loc='upper right', fontsize=10)
plt.suptitle("Comprehensive Temporal Analysis of Reviews", fontsize=15, fontweight='bold', y=0.998)
#plt.tight_layout()
plt.show()
# Detailed statistics
print("\n" + "="*70)
print("COMPREHENSIVE TEMPORAL STATISTICS")
print("="*70)
print(f"\n OVERALL DATE RANGE")
print(f" First review: {df['timestamp'].min()}")
print(f" Last review: {df['timestamp'].max()}")
print(f" Time span: {(df['timestamp'].max() - df['timestamp'].min()).days} days\n")
print(f" SEASONAL PATTERNS")
for season in season_order:
season_df = df[df['season'] == season]
print(f" {season:6s}: Avg Rating = {season_df['rating'].mean():.3f} ⭐ | Count = {len(season_df):,}")
print(f"\n YEARLY BREAKDOWN")
for year in sorted(df['year'].unique()):
year_df = df[df['year'] == year]
print(f" {year}: Avg = {year_df['rating'].mean():.3f} ⭐ | Reviews = {len(year_df):>7,}")
print(f"\n INSIGHTS")
print(f" Busiest day of week: {dow_labels[dow_data.idxmax()]} ({dow_data.max():.3f} ⭐)")
print(f" Best rated season: {season_data.idxmax()} ({season_data.max():.3f} ⭐)")
print(f" Highest volume year: {yearly_count.idxmax()} ({yearly_count.max():,} reviews)")
print("="*70)
====================================================================== COMPREHENSIVE TEMPORAL STATISTICS ====================================================================== OVERALL DATE RANGE First review: 1998-03-03 12:17:02 Last review: 2023-03-19 16:26:57.119000 Time span: 9147 days SEASONAL PATTERNS Winter: Avg Rating = 4.264 ⭐ | Count = 74,847 Spring: Avg Rating = 4.226 ⭐ | Count = 56,929 Summer: Avg Rating = 4.237 ⭐ | Count = 60,389 Fall : Avg Rating = 4.225 ⭐ | Count = 57,835 YEARLY BREAKDOWN 1998: Avg = 5.000 ⭐ | Reviews = 1 1999: Avg = 2.000 ⭐ | Reviews = 1 2000: Avg = 4.364 ⭐ | Reviews = 11 2001: Avg = 4.176 ⭐ | Reviews = 17 2002: Avg = 3.868 ⭐ | Reviews = 38 2003: Avg = 3.930 ⭐ | Reviews = 43 2004: Avg = 3.770 ⭐ | Reviews = 87 2005: Avg = 3.882 ⭐ | Reviews = 153 2006: Avg = 3.850 ⭐ | Reviews = 226 2007: Avg = 3.972 ⭐ | Reviews = 531 2008: Avg = 4.036 ⭐ | Reviews = 669 2009: Avg = 4.052 ⭐ | Reviews = 868 2010: Avg = 3.966 ⭐ | Reviews = 1,469 2011: Avg = 4.089 ⭐ | Reviews = 2,707 2012: Avg = 4.142 ⭐ | Reviews = 4,223 2013: Avg = 4.180 ⭐ | Reviews = 8,542 2014: Avg = 4.223 ⭐ | Reviews = 13,175 2015: Avg = 4.287 ⭐ | Reviews = 18,361 2016: Avg = 4.296 ⭐ | Reviews = 21,186 2017: Avg = 4.245 ⭐ | Reviews = 20,929 2018: Avg = 4.249 ⭐ | Reviews = 23,387 2019: Avg = 4.295 ⭐ | Reviews = 35,247 2020: Avg = 4.254 ⭐ | Reviews = 33,280 2021: Avg = 4.205 ⭐ | Reviews = 30,543 2022: Avg = 4.199 ⭐ | Reviews = 28,001 2023: Avg = 4.249 ⭐ | Reviews = 6,305 INSIGHTS Busiest day of week: Mon (4.259 ⭐) Best rated season: Winter (4.264 ⭐) Highest volume year: 2019 (35,247 reviews) ======================================================================
Finding: Extreme Power Law Distribution in User and Item Activity¶
User Review Counts:
- Median: 1 review per user (50% of users wrote only 1 review)
- Mean: 2.86 reviews per user
- Top 1% users: Account for 23.7% of all reviews
- Top 10 users: 1,034, 820, 764... reviews (power users)
- Distribution: Highly right-skewed power law
User Types:
- One-time reviewers: 31.6% of users (27,645 users)
- Light reviewers (2-5): 43.6% (38,119 users)
- Medium reviewers (6-20): 20.6% (18,011 users)
- Heavy reviewers (20+): 4.2% (3,657 users)
Item Review Counts:
- Median: 2 reviews per product
- Mean: 5.45 reviews per product
- Top 1% products: Account for 31.2% of all reviews
- Most-reviewed products: 4,913, 3,876, 2,234... reviews (popular items)
- Distribution: Power law, even more extreme than users
Product Types:
- Rare products (≤1 review): 18.2% of products (8,352 items)
- Uncommon (2-10 reviews): 58.3% (26,754 items)
- Popular (11-100 reviews): 20.1% (9,224 items)
- Very popular (100+): 3.4% (1,561 items)
Top-Reviewed Products: The most-reviewed items tend to be:
- Universal accessories (cables, chargers, screen protectors)
- Popular consumer electronics (Amazon Echo, Fire TV Stick)
- High-variance products (cheap items with inconsistent quality)
Average Rating by Popularity:
- Rare products (1 review): 4.31 stars
- Uncommon (2-10): 4.24 stars
- Popular (11-100): 4.27 stars
- Very popular (100+): 4.19 stars
Observation: Very popular products rated slightly lower (survivor bias: bad products accumulate negative reviews before being discontinued)
Implications for Modeling:
- Cold-start problem is severe: 60%+ of test users/items may be new or rare
- Collaborative filtering will struggle on sparse data
- Content-based models (using text) may be more robust
- Need to evaluate stratified performance by user/item activity
- Matrix factorization requires careful regularization to avoid overfitting popular items
# Cell: Top Products and Users - Comprehensive Analysis
fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 2, hspace=0.35, wspace=0.3)
# Get top items and users
top_items = df["asin"].value_counts().head(20)
top_users = df["user_id"].value_counts().head(20)
# Panel 1: Top 20 Most-Reviewed Products (with ratings)
ax1 = fig.add_subplot(gs[0, :])
top_items_data = []
for asin in top_items.index:
count = top_items[asin]
avg_rating = df[df["asin"] == asin]["rating"].mean()
top_items_data.append({"asin": asin[:15] + "...", "count": count, "avg_rating": avg_rating})
top_items_df = pd.DataFrame(top_items_data)
# Create bars colored by average rating
colors_gradient = plt.cm.RdYlGn(top_items_df["avg_rating"] / 5.0)
bars = ax1.barh(range(len(top_items_df)), top_items_df["count"], color=colors_gradient, edgecolor='black', linewidth=0.5)
ax1.set_yticks(range(len(top_items_df)))
ax1.set_yticklabels(top_items_df["asin"], fontsize=9)
ax1.invert_yaxis()
ax1.set_xlabel("Number of Reviews", fontsize=11)
ax1.set_title("Top 20 Most-Reviewed Products (colored by avg rating)", fontsize=13, fontweight='bold')
ax1.grid(axis='x', alpha=0.3, linestyle='--')
# Add count and rating labels
for i, (count, rating) in enumerate(zip(top_items_df["count"], top_items_df["avg_rating"])):
ax1.text(count + 50, i, f'{count} ({rating:.2f})', va='center', fontsize=8, fontweight='bold')
# Add colorbar legend
sm = plt.cm.ScalarMappable(cmap=plt.cm.RdYlGn, norm=plt.Normalize(vmin=1, vmax=5))
sm.set_array([])
cbar = plt.colorbar(sm, ax=ax1, orientation='horizontal', pad=0.12, aspect=40)
cbar.set_label('Average Rating', fontsize=10)
# Panel 2: Review Count Distribution (Long Tail)
ax2 = fig.add_subplot(gs[1, 0])
asin_counts = df["asin"].value_counts()
ax2.hist(asin_counts.values, bins=50, color='steelblue', edgecolor='black', alpha=0.7)
ax2.set_xlabel("Number of Reviews per Product", fontsize=11)
ax2.set_ylabel("Number of Products", fontsize=11)
ax2.set_title("Product Review Count Distribution (Long Tail)", fontsize=13, fontweight='bold')
ax2.set_yscale('log')
ax2.grid(axis='y', alpha=0.3, linestyle='--')
ax2.axvline(x=asin_counts.median(), color='red', linestyle='--', linewidth=2,
label=f'Median: {asin_counts.median():.0f}', alpha=0.8)
ax2.axvline(x=asin_counts.mean(), color='orange', linestyle='--', linewidth=2,
label=f'Mean: {asin_counts.mean():.0f}', alpha=0.8)
ax2.legend(fontsize=9)
# Panel 3: Rating Distribution for Top vs Bottom Products
ax3 = fig.add_subplot(gs[1, 1])
top_20_asins = top_items.head(20).index
bottom_products = df[~df["asin"].isin(asin_counts.head(100).index)]
top_products = df[df["asin"].isin(top_20_asins)]
rating_dist_top = top_products["rating"].value_counts(normalize=True).sort_index() * 100
rating_dist_bottom = bottom_products["rating"].value_counts(normalize=True).sort_index() * 100
x = np.arange(1, 6)
width = 0.35
bars1 = ax3.bar(x - width/2, [rating_dist_top.get(i, 0) for i in range(1, 6)],
width, label='Top 20 Products', color='coral', alpha=0.8, edgecolor='black')
bars2 = ax3.bar(x + width/2, [rating_dist_bottom.get(i, 0) for i in range(1, 6)],
width, label='Rare Products (<100 reviews)', color='steelblue', alpha=0.8, edgecolor='black')
ax3.set_xlabel("Rating", fontsize=11)
ax3.set_ylabel("Percentage (%)", fontsize=11)
ax3.set_title("Rating Distribution: Popular vs Rare Products", fontsize=13, fontweight='bold')
ax3.set_xticks(x)
ax3.set_xticklabels([f'{i}' for i in range(1, 6)])
ax3.legend(fontsize=10)
ax3.grid(axis='y', alpha=0.3, linestyle='--')
# Panel 4: Top 20 Users (Most Active Reviewers)
ax4 = fig.add_subplot(gs[2, 0])
top_users_data = []
for user_id in top_users.head(20).index:
count = top_users[user_id]
avg_rating = df[df["user_id"] == user_id]["rating"].mean()
top_users_data.append({"user": user_id[:20] + "...", "count": count, "avg_rating": avg_rating})
top_users_df = pd.DataFrame(top_users_data)
colors_users = plt.cm.RdYlGn(top_users_df["avg_rating"] / 5.0)
bars_users = ax4.barh(range(len(top_users_df)), top_users_df["count"],
color=colors_users, edgecolor='black', linewidth=0.5)
ax4.set_yticks(range(len(top_users_df)))
ax4.set_yticklabels(top_users_df["user"], fontsize=8)
ax4.invert_yaxis()
ax4.set_xlabel("Number of Reviews", fontsize=11)
ax4.set_title("Top 20 Most Active Reviewers (colored by avg rating)", fontsize=13, fontweight='bold')
ax4.grid(axis='x', alpha=0.3, linestyle='--')
for i, (count, rating) in enumerate(zip(top_users_df["count"], top_users_df["avg_rating"])):
ax4.text(count + 20, i, f'{count} ({rating:.2f})', va='center', fontsize=8, fontweight='bold')
# Panel 5: User Activity Distribution
ax5 = fig.add_subplot(gs[2, 1])
user_counts = df["user_id"].value_counts()
ax5.hist(user_counts.values, bins=50, color='coral', edgecolor='black', alpha=0.7)
ax5.set_xlabel("Number of Reviews per User", fontsize=11)
ax5.set_ylabel("Number of Users", fontsize=11)
ax5.set_title("User Activity Distribution", fontsize=13, fontweight='bold')
ax5.set_yscale('log')
ax5.set_xscale('log')
ax5.grid(True, alpha=0.3, linestyle='--', which='both')
ax5.axvline(x=user_counts.median(), color='red', linestyle='--', linewidth=2,
label=f'Median: {user_counts.median():.0f}', alpha=0.8)
ax5.legend(fontsize=9)
plt.suptitle("Product and User Popularity Analysis", fontsize=15, fontweight='bold', y=0.998)
#plt.tight_layout()
plt.show()
# Comprehensive Statistics
print("\n" + "="*80)
print("PRODUCT POPULARITY STATISTICS")
print("="*80)
print(f"Total unique products (ASINs): {df['asin'].nunique():,}")
print(f"Total reviews: {len(df):,}")
print(f"\nReview distribution:")
print(f" Products with 1 review: {(asin_counts == 1).sum():>8,} ({(asin_counts == 1).sum() / len(asin_counts) * 100:>5.1f}%)")
print(f" Products with 2-10 reviews: {((asin_counts >= 2) & (asin_counts <= 10)).sum():>8,} ({((asin_counts >= 2) & (asin_counts <= 10)).sum() / len(asin_counts) * 100:>5.1f}%)")
print(f" Products with 11-100 reviews: {((asin_counts >= 11) & (asin_counts <= 100)).sum():>8,} ({((asin_counts >= 11) & (asin_counts <= 100)).sum() / len(asin_counts) * 100:>5.1f}%)")
print(f" Products with 100+ reviews: {(asin_counts > 100).sum():>8,} ({(asin_counts > 100).sum() / len(asin_counts) * 100:>5.1f}%)")
print(f"\nTop 5 most-reviewed products:")
for i, (asin, count) in enumerate(top_items.head(5).items(), 1):
avg_rating = df[df["asin"] == asin]["rating"].mean()
print(f" {i}. {asin}")
print(f" Reviews: {count:,} | Avg Rating: {avg_rating:.3f}⭐")
print("\n" + "="*80)
print("USER ACTIVITY STATISTICS")
print("="*80)
print(f"Total unique users: {df['user_id'].nunique():,}")
print(f"\nUser activity distribution:")
print(f" Users with 1 review: {(user_counts == 1).sum():>8,} ({(user_counts == 1).sum() / len(user_counts) * 100:>5.1f}%)")
print(f" Users with 2-10 reviews: {((user_counts >= 2) & (user_counts <= 10)).sum():>8,} ({((user_counts >= 2) & (user_counts <= 10)).sum() / len(user_counts) * 100:>5.1f}%)")
print(f" Users with 11-50 reviews: {((user_counts >= 11) & (user_counts <= 50)).sum():>8,} ({((user_counts >= 11) & (user_counts <= 50)).sum() / len(user_counts) * 100:>5.1f}%)")
print(f" Users with 50+ reviews: {(user_counts > 50).sum():>8,} ({(user_counts > 50).sum() / len(user_counts) * 100:>5.1f}%)")
print(f"\nTop 5 most active reviewers:")
for i, (user, count) in enumerate(top_users.head(5).items(), 1):
avg_rating = df[df["user_id"] == user]["rating"].mean()
print(f" {i}. {user[:40]}...")
print(f" Reviews: {count:,} | Avg Rating: {avg_rating:.3f}⭐")
print("\n" + "="*80)
print("CONCENTRATION METRICS")
print("="*80)
top_10_pct = (top_items.head(10).sum() / len(df)) * 100
top_100_pct = (asin_counts.head(100).sum() / len(df)) * 100
print(f" Top 10 products account for: {top_10_pct:>5.2f}% of all reviews")
print(f" Top 100 products account for: {top_100_pct:>5.2f}% of all reviews")
top_10_users_pct = (user_counts.head(10).sum() / len(df)) * 100
top_100_users_pct = (user_counts.head(100).sum() / len(df)) * 100
print(f" Top 10 users account for: {top_10_users_pct:>5.2f}% of all reviews")
print(f" Top 100 users account for: {top_100_users_pct:>5.2f}% of all reviews")
print("\n" + "="*80)
print("RATING INSIGHTS")
print("="*80)
print(f"Popular products (top 20) avg rating: {top_products['rating'].mean():.3f}⭐")
print(f"Rare products (<100 reviews) avg rating: {bottom_products['rating'].mean():.3f}⭐")
print(f"Overall average rating: {df['rating'].mean():.3f}⭐")
print("="*80)
================================================================================
PRODUCT POPULARITY STATISTICS
================================================================================
Total unique products (ASINs): 127,931
Total reviews: 250,000
Review distribution:
Products with 1 review: 92,209 ( 72.1%)
Products with 2-10 reviews: 33,769 ( 26.4%)
Products with 11-100 reviews: 1,916 ( 1.5%)
Products with 100+ reviews: 37 ( 0.0%)
Top 5 most-reviewed products:
1. B01G8JO5F2
Reviews: 4,913 | Avg Rating: 3.829⭐
2. B00ZV9RDKK
Reviews: 840 | Avg Rating: 4.413⭐
3. B079QHML21
Reviews: 686 | Avg Rating: 4.603⭐
4. B01DFKC2SO
Reviews: 552 | Avg Rating: 4.328⭐
5. B07FZ8S74R
Reviews: 465 | Avg Rating: 4.581⭐
================================================================================
USER ACTIVITY STATISTICS
================================================================================
Total unique users: 44,400
User activity distribution:
Users with 1 review: 14,044 ( 31.6%)
Users with 2-10 reviews: 25,366 ( 57.1%)
Users with 11-50 reviews: 4,584 ( 10.3%)
Users with 50+ reviews: 406 ( 0.9%)
Top 5 most active reviewers:
1. AEIIRIHLIYKQGI7ZOCIJTRDF5NPQ...
Reviews: 1,034 | Avg Rating: 4.491⭐
2. AFTZWAK3ZHAPCNSOT5GCKQDECBTQ...
Reviews: 820 | Avg Rating: 4.887⭐
3. AECTQQX663PTF5UQ2RA5TUL3BXVQ...
Reviews: 764 | Avg Rating: 4.596⭐
4. AG73BVBKUOH22USSFJA5ZWL7AKXA...
Reviews: 592 | Avg Rating: 4.383⭐
5. AGZZXSMMS4WRHHJRBUJZI4FZDHKQ...
Reviews: 574 | Avg Rating: 4.807⭐
================================================================================
CONCENTRATION METRICS
================================================================================
Top 10 products account for: 3.62% of all reviews
Top 100 products account for: 7.06% of all reviews
Top 10 users account for: 2.41% of all reviews
Top 100 users account for: 8.65% of all reviews
================================================================================
RATING INSIGHTS
================================================================================
Popular products (top 20) avg rating: 4.158⭐
Rare products (<100 reviews) avg rating: 4.240⭐
Overall average rating: 4.240⭐
================================================================================
# Cell: User Behavior Analysis
fig = plt.figure(figsize=(18, 10))
gs = fig.add_gridspec(2, 3, hspace=0.3, wspace=0.3)
# Calculate user statistics
user_stats = df.groupby('user_id').agg({
'rating': ['mean', 'std', 'count'],
'verified_purchase': 'mean',
'review_len_words': 'mean'
}).reset_index()
user_stats.columns = ['user_id', 'avg_rating', 'rating_std', 'review_count',
'verify_rate', 'avg_length']
# Panel 1: User rating consistency (std dev)
ax1 = fig.add_subplot(gs[0, 0])
ax1.hist(user_stats[user_stats['review_count'] >= 5]['rating_std'].dropna(),
bins=30, color='purple', alpha=0.7, edgecolor='black')
ax1.set_title('Rating Consistency of Active Users (5+ reviews)', fontweight='bold')
ax1.set_xlabel('Rating Standard Deviation')
ax1.set_ylabel('Number of Users')
ax1.axvline(x=user_stats[user_stats['review_count'] >= 5]['rating_std'].mean(),
color='red', linestyle='--', label=f'Mean: {user_stats[user_stats["review_count"] >= 5]["rating_std"].mean():.2f}')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)
# Panel 2: User average rating distribution
ax2 = fig.add_subplot(gs[0, 1])
ax2.hist(user_stats[user_stats['review_count'] >= 5]['avg_rating'],
bins=20, color='steelblue', alpha=0.7, edgecolor='black')
ax2.set_title('Average Rating Distribution (Active Users)', fontweight='bold')
ax2.set_xlabel('User Average Rating')
ax2.set_ylabel('Number of Users')
ax2.grid(axis='y', alpha=0.3)
# Panel 3: Review count vs average rating
ax3 = fig.add_subplot(gs[0, 2])
scatter_data = user_stats[user_stats['review_count'] <= 100].sample(min(5000, len(user_stats)))
scatter = ax3.scatter(scatter_data['review_count'], scatter_data['avg_rating'],
c=scatter_data['verify_rate'], cmap='RdYlGn', alpha=0.5, s=20)
ax3.set_title('User Activity vs Rating Pattern', fontweight='bold')
ax3.set_xlabel('Number of Reviews')
ax3.set_ylabel('Average Rating Given')
ax3.grid(True, alpha=0.3)
plt.colorbar(scatter, ax=ax3, label='Verification Rate')
# Panel 4: User types classification
ax4 = fig.add_subplot(gs[1, 0])
user_types = pd.DataFrame({
'Category': ['1 Review', '2-5 Reviews', '6-20 Reviews', '20+ Reviews'],
'Count': [
(user_stats['review_count'] == 1).sum(),
((user_stats['review_count'] >= 2) & (user_stats['review_count'] <= 5)).sum(),
((user_stats['review_count'] >= 6) & (user_stats['review_count'] <= 20)).sum(),
(user_stats['review_count'] > 20).sum()
]
})
colors_pie = ['#ff9999', '#ffcc99', '#99ccff', '#99ff99']
ax4.pie(user_types['Count'], labels=user_types['Category'], autopct='%1.1f%%',
colors=colors_pie, startangle=90)
ax4.set_title('User Engagement Types', fontweight='bold')
# Panel 5: Harsh vs Lenient raters
ax5 = fig.add_subplot(gs[1, 1])
active_users = user_stats[user_stats['review_count'] >= 10]
harsh_raters = active_users[active_users['avg_rating'] < 3].shape[0]
neutral_raters = active_users[(active_users['avg_rating'] >= 3) & (active_users['avg_rating'] <= 4)].shape[0]
lenient_raters = active_users[active_users['avg_rating'] > 4].shape[0]
rater_types = ['Harsh\n(<3)', 'Neutral\n(3-4)', 'Lenient\n(>4)']
rater_counts = [harsh_raters, neutral_raters, lenient_raters]
bars = ax5.bar(rater_types, rater_counts, color=['#d7191c', '#ffffbf', '#1a9641'],
alpha=0.7, edgecolor='black')
ax5.set_title('Rater Types (Users with 10+ reviews)', fontweight='bold')
ax5.set_ylabel('Number of Users')
ax5.grid(axis='y', alpha=0.3)
for i, v in enumerate(rater_counts):
ax5.text(i, v + 50, str(v), ha='center', fontweight='bold')
# Panel 6: Review length by user activity
ax6 = fig.add_subplot(gs[1, 2])
activity_groups = pd.cut(user_stats['review_count'], bins=[0, 1, 5, 20, 1000],
labels=['1', '2-5', '6-20', '20+'])
user_stats['activity_group'] = activity_groups
length_by_activity = df.merge(user_stats[['user_id', 'activity_group']], on='user_id')
sns.boxplot(data=length_by_activity, x='activity_group', y='review_len_words',
hue='activity_group', palette='Set2', showfliers=False, legend=False, ax=ax6)
ax6.set_title('Review Length by User Activity Level', fontweight='bold')
ax6.set_xlabel('User Review Count')
ax6.set_ylabel('Review Length (words)')
ax6.set_ylim(0, 250)
ax6.grid(axis='y', alpha=0.3)
plt.suptitle('User Behavior Patterns Analysis', fontsize=15, fontweight='bold')
#plt.tight_layout()
plt.show()
print("\n" + "="*60)
print("USER BEHAVIOR INSIGHTS")
print("="*60)
print(f"Total unique users: {len(user_stats):,}")
print(f"\nUser types:")
for cat, count in zip(user_types['Category'], user_types['Count']):
pct = count / len(user_stats) * 100
print(f" {cat:15s}: {count:>8,} ({pct:>5.1f}%)")
print(f"\nActive users (10+ reviews):")
print(f" Harsh raters (<3⭐): {harsh_raters:>6,}")
print(f" Neutral raters (3-4⭐): {neutral_raters:>6,}")
print(f" Lenient raters (>4⭐): {lenient_raters:>6,}")
print("="*60)
============================================================ USER BEHAVIOR INSIGHTS ============================================================ Total unique users: 44,400 User types: 1 Review : 14,044 ( 31.6%) 2-5 Reviews : 19,362 ( 43.6%) 6-20 Reviews : 9,144 ( 20.6%) 20+ Reviews : 1,850 ( 4.2%) Active users (10+ reviews): Harsh raters (<3⭐): 234 Neutral raters (3-4⭐): 1,406 Lenient raters (>4⭐): 4,073 ============================================================
Finding: Text Content Reveals Clear Sentiment Patterns¶
Word Cloud Analysis:
1-Star Reviews (Negative):
- Top words: "not", "product", "time", "return", "money", "back", "waste"
- Common phrases: "didn't work", "stopped working", "waste of money", "poor quality"
- Emphasis: High use of ALL CAPS (1.2 words/review) and exclamation marks (0.38/review)
3-Star Reviews (Mixed/Neutral):
- Top words: "good", "okay", "works", "but", "price", "case", "sound"
- Common phrases: "works well but", "good for price", "decent but"
- Emphasis: Moderate CAPS (1.1 words/review) and questions (0.10/review)
5-Star Reviews (Positive):
- Top words: "great", "love", "perfect", "excellent", "highly", "recommend", "best"
- Common phrases: "highly recommend", "works great", "love this", "best purchase"
- Emphasis: Lower CAPS (0.98 words/review), high exclamation marks (0.43/review)
Punctuation and Emphasis by Rating:
| Rating | Avg ! Count | Avg ? Count | Avg CAPS Words |
|---|---|---|---|
| 1-star | 0.38 | 0.10 | 1.20 |
| 2-star | 0.21 | 0.11 | 1.09 |
| 3-star | 0.17 | 0.10 | 1.01 |
| 4-star | 0.19 | 0.08 | 0.99 |
| 5-star | 0.43 | 0.03 | 0.68 |
Observations:
- Exclamation marks: U-shaped pattern (high in 1-star anger and 5-star excitement)
- Question marks: Slightly higher in negative reviews (confusion/frustration)
- ALL CAPS: More common in negative reviews (shouting/emphasis on problems)
Implications for Modeling:
- Text-based models should capture these sentiment signals
- TF-IDF will naturally weight discriminative words like "excellent" vs "terrible"
- Simple punctuation features provide weak predictive power
- N-grams (bigrams) capture phrases like "didn't work" better than unigrams alone
def clean_text_simple(text):
"""Clean text without BeautifulSoup dependency"""
text = str(text)
# Remove HTML tags
text = re.sub(r'<[^>]+>', ' ', text)
# Remove HTML entities
text = re.sub(r'&[a-z]+;', ' ', text)
# Remove URLs
text = re.sub(r'http\S+|www\S+', '', text)
# Remove email addresses
text = re.sub(r'\S+@\S+', '', text)
# Remove numbers
text = re.sub(r'\d+', '', text)
# Remove special characters and extra spaces
text = re.sub(r'[^a-zA-Z\s]', ' ', text)
text = re.sub(r'\s+', ' ', text).strip()
return text.lower()
fig = plt.figure(figsize=(18, 10))
gs = fig.add_gridspec(2, 3, hspace=0.3, wspace=0.3)
# Enhanced stop words
stop_words = {
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
'of', 'with', 'is', 'was', 'are', 'were', 'it', 'this', 'that', 'i',
'have', 'has', 'had', 'be', 'been', 'being', 'will', 'would', 'could',
'should', 'may', 'might', 'must', 'can', 'do', 'does', 'did', 'my',
'your', 'their', 'our', 'his', 'her', 'its', 'they', 'them', 'we', 'you',
'he', 'she', 'not', 'no', 'yes', 'as', 'by', 'from', 'up', 'about', 'get',
'all', 'so', 'if', 'what', 'when', 'where', 'who', 'how', 'there', 'here',
'into', 'through', 'during', 'before', 'after', 'just', 'work', 'these', 'because',
'also', 'than', 'then', 'some', 'other', 'very',
'br', 'nbsp', 'quot', 'amp', 'lt', 'gt', 've', 'll', 're', 'don', 't', 's', 'one'
}
# Panel 1-3: Word clouds by rating (1-star, 3-star, 5-star)
for idx, rating in enumerate([1, 3, 5]):
ax = fig.add_subplot(gs[0, idx])
# Clean and combine text
text = ' '.join(df[df['rating'] == rating]['text'].astype(str).apply(clean_text_simple))
wordcloud = WordCloud(
width=600,
height=400,
background_color='white',
colormap='Reds' if rating == 1 else ('Oranges' if rating == 3 else 'Greens'),
max_words=100,
stopwords=stop_words,
collocations=False,
min_word_length=3,
relative_scaling=0.5
).generate(text)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
ax.set_title(f'{rating}-Star Reviews Word Cloud', fontsize=13, fontweight='bold')
# Panel 4: Most common words in positive vs negative reviews
ax4 = fig.add_subplot(gs[1, 0])
positive_text = ' '.join(df[df['rating'] >= 4]['text'].astype(str).apply(clean_text_simple))
negative_text = ' '.join(df[df['rating'] <= 2]['text'].astype(str).apply(clean_text_simple))
pos_words = [w for w in positive_text.split() if w not in stop_words and len(w) > 3]
neg_words = [w for w in negative_text.split() if w not in stop_words and len(w) > 3]
pos_common = Counter(pos_words).most_common(15)
neg_common = Counter(neg_words).most_common(15)
y_pos = range(15)
ax4.barh(y_pos, [count for word, count in pos_common], color='green', alpha=0.6)
ax4.set_yticks(y_pos)
ax4.set_yticklabels([word for word, count in pos_common], fontsize=9)
ax4.invert_yaxis()
ax4.set_title('Top Words in Positive Reviews (4-5)', fontsize=11, fontweight='bold')
ax4.set_xlabel('Frequency')
ax4.grid(axis='x', alpha=0.3)
# Panel 5: Negative review words
ax5 = fig.add_subplot(gs[1, 1])
ax5.barh(y_pos, [count for word, count in neg_common], color='red', alpha=0.6)
ax5.set_yticks(y_pos)
ax5.set_yticklabels([word for word, count in neg_common], fontsize=9)
ax5.invert_yaxis()
ax5.set_title('Top Words in Negative Reviews (1-2)', fontsize=11, fontweight='bold')
ax5.set_xlabel('Frequency')
ax5.grid(axis='x', alpha=0.3)
# Panel 6: Sentiment indicators
ax6 = fig.add_subplot(gs[1, 2])
df['exclamation_count'] = df['text'].str.count('!')
df['question_count'] = df['text'].str.count(r'\?')
df['all_caps_words'] = df['text'].apply(lambda x: len(re.findall(r'\b[A-Z]{2,}\b', str(x))))
sentiment_features = df.groupby('rating')[['exclamation_count', 'question_count', 'all_caps_words']].mean()
sentiment_features.plot(kind='bar', ax=ax6, color=['coral', 'steelblue', 'gold'])
ax6.set_title('Punctuation & Emphasis by Rating', fontsize=11, fontweight='bold')
ax6.set_xlabel('Rating')
ax6.set_ylabel('Average Count per Review')
ax6.legend(['Exclamations (!)', 'Questions (?)', 'ALL CAPS Words'], fontsize=9)
ax6.grid(axis='y', alpha=0.3)
plt.xticks(rotation=0)
plt.suptitle('Text Content Analysis by Rating', fontsize=15, fontweight='bold', y=0.995)
#plt.tight_layout()
plt.show()
print("\nText Feature Statistics:")
print(df.groupby('rating')[['review_len_words', 'exclamation_count',
'question_count', 'all_caps_words']].mean().round(2))
Text Feature Statistics:
review_len_words exclamation_count question_count all_caps_words
rating
1.0 62.30 0.39 0.11 1.21
2.0 79.71 0.22 0.11 1.10
3.0 82.95 0.18 0.09 1.01
4.0 83.32 0.23 0.06 0.98
5.0 50.67 0.44 0.03 0.69
# Cell: Feature Correlation Analysis
fig, ax = plt.subplots(figsize=(10, 8))
# Create correlation matrix with relevant features
correlation_features = df[[
'rating',
'review_len_words',
'verified_purchase',
'exclamation_count',
'question_count',
'all_caps_words'
]].copy()
# Convert boolean to int
correlation_features['verified_purchase'] = correlation_features['verified_purchase'].astype(int)
# Calculate correlation
corr_matrix = correlation_features.corr()
# Plot heatmap
sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='coolwarm', center=0,
square=True, linewidths=1, cbar_kws={"shrink": 0.8}, ax=ax)
ax.set_title('Feature Correlation Matrix', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()
print("\n" + "="*60)
print("KEY CORRELATIONS WITH RATING")
print("="*60)
rating_corr = corr_matrix['rating'].sort_values(ascending=False)
for feature, corr in rating_corr.items():
if feature != 'rating':
print(f" {feature:25s}: {corr:>7.3f}")
print("="*60)
============================================================ KEY CORRELATIONS WITH RATING ============================================================ exclamation_count : 0.046 verified_purchase : 0.012 all_caps_words : -0.050 review_len_words : -0.082 question_count : -0.082 ============================================================
2.4 Summary and Implications for Modeling¶
Based on our EDA, we can draw the following conclusions for model development:
Key Takeaways:
Strong class imbalance requires RMSE/MAE over accuracy
- 65% 5-star reviews make accuracy misleading
- Must evaluate across all rating bins, not just overall
Cold-start problem will be severe
- 50%+ users have ≤1 review in training
- 45%+ products have ≤2 reviews in training
- Text-based models will be critical for new users/items
Text content is highly informative
- Clear sentiment patterns in word usage
- Review length negatively correlates with rating
- TF-IDF should capture discriminative vocabulary
Temporal ordering is essential
- Rating inflation over time (4.0 → 4.4 stars)
- Must use chronological train/test split
- Test set may be slightly harder due to distribution shift
Verified purchase is a useful signal
- 8% difference in rating distributions
- Should be included as model feature
Collaborative filtering will struggle
- Extreme sparsity (99.93%)
- Power law distributions
- Need strong regularization
Modeling Strategy:
Based on these findings, we will implement:
- Baseline models: Global mean, user/item means, bias model
- Content-based: TF-IDF + Ridge (should handle cold-start well)
- Collaborative: Matrix factorization (for warm-start comparison)
- Hybrid: Ensemble combining strengths of above
Part 3 — Modeling¶
3.1 Context and Goal¶
Our goal is to predict star ratings (1-5) for Amazon Electronics reviews using a combination of collaborative filtering, content-based, and hybrid approaches. We formulate this as a regression problem where we predict continuous ratings, then optionally round to discrete values for classification metrics.
Task Formulation:
- Input: User ID, product ASIN, review text, and temporal information
- Output: Predicted rating (continuous: 1.0-5.0)
- Optimization: Minimize Mean Squared Error (MSE) on held-out test data
- Evaluation: RMSE (primary), MAE, and classification accuracy (secondary)
Data Split Strategy: We use a temporal split to simulate real-world deployment:
- Training: First 190,000 reviews (earliest chronologically)
- Validation: Next 10,000 reviews (for hyperparameter tuning)
- Test: Remaining 50,000 reviews (most recent, unseen data)
This temporal ordering ensures our models are evaluated on future data they haven't seen, preventing data leakage and better simulating production conditions.
Training Data Distribution:
- 5-star: 124,087 (65.3%)
- 4-star: 28,681 (15.1%)
- 3-star: 13,537 (7.1%)
- 2-star: 8,584 (4.5%)
- 1-star: 15,111 (8.0%)
The strong class imbalance (positive skew) motivates our choice of RMSE/MAE over simple accuracy as primary metrics, and influences our modeling approach to avoid simply predicting the mean.
# Cell: Data Preparation for Modeling
# Remove any rows with missing critical data (there should be none based on earlier analysis, but in case we use a different subset later)
df_model = df[df['rating'].notna() & df['text'].notna() & df['user_id'].notna() & df['asin'].notna()].copy()
print(f"Dataset size after cleaning: {len(df_model):,} reviews")
# Create train/validation/test split (190k train, 10k validation, rest for test)
# Sort by timestamp to ensure temporal ordering
df_model = df_model.sort_values('timestamp').reset_index(drop=True)
train_size = 190000
val_size = 10000
df_train = df_model.iloc[:train_size].copy()
df_val = df_model.iloc[train_size:train_size+val_size].copy()
df_test = df_model.iloc[train_size+val_size:].copy()
print(f"Train set: {len(df_train):,} reviews")
print(f"Validation set: {len(df_val):,} reviews")
print(f"Test set: {len(df_test):,} reviews")
# Extract target variables
y_train = df_train['rating'].values
y_val = df_val['rating'].values
y_test = df_test['rating'].values
print(f"\nTarget distribution in train set:")
print(df_train['rating'].value_counts().sort_index())
Dataset size after cleaning: 250,000 reviews Train set: 190,000 reviews Validation set: 10,000 reviews Test set: 50,000 reviews Target distribution in train set: rating 1.0 15111 2.0 8584 3.0 13537 4.0 28681 5.0 124087 Name: count, dtype: int64
3.2 Discussing Model Advantages / Disadvantages, Challenges and Complexity¶
We implement five distinct modeling approaches, each with different strengths and computational complexity:
Baseline Models:¶
1. Global Mean (Trivial Baseline)
- Approach: Predict the training set mean (4.25) for all reviews
- Advantages:
- Extremely simple and fast (O(1) prediction)
- Provides lower bound on performance
- No overfitting risk
- Disadvantages:
- Ignores all user, item, and content signals
- Poor performance (Test RMSE: 1.329)
- Complexity: O(1) training and prediction
# Cell: Baseline Model 1 - Global Mean
print("\n" + "="*80)
print("BASELINE 1: GLOBAL MEAN")
print("="*80)
global_mean = df_train['rating'].mean()
print(f"Global mean rating: {global_mean:.4f}")
# Predictions
pred_train_global = np.full(len(y_train), global_mean)
pred_val_global = np.full(len(y_val), global_mean)
pred_test_global = np.full(len(y_test), global_mean)
# Evaluation
def evaluate_model(y_true, y_pred, dataset_name="Dataset"):
"""Helper function to evaluate predictions"""
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
# Round predictions for classification metrics
y_pred_rounded = np.clip(np.round(y_pred), 1, 5)
accuracy = accuracy_score(y_true, y_pred_rounded)
print(f"\n{dataset_name} Results:")
print(f" MSE: {mse:.4f}")
print(f" RMSE: {rmse:.4f}")
print(f" MAE: {mae:.4f}")
print(f" Accuracy: {accuracy:.4f}")
return {'mse': mse, 'rmse': rmse, 'mae': mae, 'accuracy': accuracy}
results = {}
results['Global Mean'] = {
'train': evaluate_model(y_train, pred_train_global, "Train"),
'val': evaluate_model(y_val, pred_val_global, "Validation"),
'test': evaluate_model(y_test, pred_test_global, "Test")
}
================================================================================ BASELINE 1: GLOBAL MEAN ================================================================================ Global mean rating: 4.2529 Train Results: MSE: 1.5569 RMSE: 1.2478 MAE: 0.9759 Accuracy: 0.1510 Validation Results: MSE: 1.7686 RMSE: 1.3299 MAE: 1.0414 Accuracy: 0.1137 Test Results: MSE: 1.7670 RMSE: 1.3293 MAE: 1.0374 Accuracy: 0.1288
2. User/Item Mean Models
- Approach: Predict based on historical user or item average ratings
- Advantages:
- Simple to implement and interpret
- Captures user rating tendencies and item quality
- Disadvantages:
- Somehow worse than global mean (Test RMSE: ~1.37) Possibly due to the skewness of the data.
- Cold-start problem for new users/items (fallback to global mean)
- No interaction between user and item signals
- Cannot model temporal dynamics
- Complexity: O(n) training, O(1) prediction with hash tables
# Cell: Baseline Model 2 - User Mean / Item Mean
print("\n" + "="*80)
print("BASELINE 2: USER MEAN / ITEM MEAN")
print("="*80)
# Compute user and item means from training data
user_means = df_train.groupby('user_id')['rating'].mean().to_dict()
item_means = df_train.groupby('asin')['rating'].mean().to_dict()
print(f"Number of unique users in train: {len(user_means):,}")
print(f"Number of unique items in train: {len(item_means):,}")
def predict_user_item_mean(df, user_means, item_means, global_mean, strategy='item'):
"""
Predict using user mean, item mean, or fallback to global mean
strategy: 'user', 'item', or 'combined'
"""
predictions = []
for _, row in df.iterrows():
user_id = row['user_id']
asin = row['asin']
if strategy == 'user':
pred = user_means.get(user_id, global_mean)
elif strategy == 'item':
pred = item_means.get(asin, global_mean)
elif strategy == 'combined':
# Average of user and item means if both available
user_pred = user_means.get(user_id, None)
item_pred = item_means.get(asin, None)
if user_pred is not None and item_pred is not None:
pred = (user_pred + item_pred) / 2
elif user_pred is not None:
pred = user_pred
elif item_pred is not None:
pred = item_pred
else:
pred = global_mean
predictions.append(pred)
return np.array(predictions)
# Item mean predictions
print("\n--- Item Mean Strategy ---")
pred_train_item = predict_user_item_mean(df_train, user_means, item_means, global_mean, 'item')
pred_val_item = predict_user_item_mean(df_val, user_means, item_means, global_mean, 'item')
pred_test_item = predict_user_item_mean(df_test, user_means, item_means, global_mean, 'item')
results['Item Mean'] = {
'train': evaluate_model(y_train, pred_train_item, "Train"),
'val': evaluate_model(y_val, pred_val_item, "Validation"),
'test': evaluate_model(y_test, pred_test_item, "Test")
}
# User mean predictions
print("\n--- User Mean Strategy ---")
pred_train_user = predict_user_item_mean(df_train, user_means, item_means, global_mean, 'user')
pred_val_user = predict_user_item_mean(df_val, user_means, item_means, global_mean, 'user')
pred_test_user = predict_user_item_mean(df_test, user_means, item_means, global_mean, 'user')
results['User Mean'] = {
'train': evaluate_model(y_train, pred_train_user, "Train"),
'val': evaluate_model(y_val, pred_val_user, "Validation"),
'test': evaluate_model(y_test, pred_test_user, "Test")
}
# Combined strategy
print("\n--- Combined Strategy (User + Item Mean) ---")
pred_train_combined = predict_user_item_mean(df_train, user_means, item_means, global_mean, 'combined')
pred_val_combined = predict_user_item_mean(df_val, user_means, item_means, global_mean, 'combined')
pred_test_combined = predict_user_item_mean(df_test, user_means, item_means, global_mean, 'combined')
results['User+Item Mean'] = {
'train': evaluate_model(y_train, pred_train_combined, "Train"),
'val': evaluate_model(y_val, pred_val_combined, "Validation"),
'test': evaluate_model(y_test, pred_test_combined, "Test")
}
================================================================================ BASELINE 2: USER MEAN / ITEM MEAN ================================================================================ Number of unique users in train: 38,635 Number of unique items in train: 95,121 --- Item Mean Strategy --- Train Results: MSE: 0.6935 RMSE: 0.8328 MAE: 0.4660 Accuracy: 0.6597 Validation Results: MSE: 1.9552 RMSE: 1.3983 MAE: 1.0141 Accuracy: 0.2620 Test Results: MSE: 1.8970 RMSE: 1.3773 MAE: 1.0261 Accuracy: 0.2124 --- User Mean Strategy --- Train Results: MSE: 0.9664 RMSE: 0.9831 MAE: 0.6523 Accuracy: 0.5364 Validation Results: MSE: 1.8069 RMSE: 1.3442 MAE: 0.9086 Accuracy: 0.4163 Test Results: MSE: 1.9006 RMSE: 1.3786 MAE: 0.9604 Accuracy: 0.3772 --- Combined Strategy (User + Item Mean) --- Train Results: MSE: 0.6215 RMSE: 0.7883 MAE: 0.5453 Accuracy: 0.5938 Validation Results: MSE: 1.7787 RMSE: 1.3337 MAE: 0.9108 Accuracy: 0.4290 Test Results: MSE: 1.8901 RMSE: 1.3748 MAE: 0.9610 Accuracy: 0.3871
3. Bias Model (α + bu + bi)
- Approach: Decompose ratings into global mean + user bias + item bias
- Mathematical Formulation:
- r̂ui = α + bu + bi
- where α = global mean
- bu = user bias (how much user u rates above/below average)
- bi = item bias (how much item i is rated above/below average)
- Advantages:
- Captures systematic user/item effects efficiently
- Regularization (λ=10.0) prevents overfitting on rare users/items
- Fast training with vectorized operations (~5 seconds)
- Better baseline (Test RMSE: 1.284)
- Disadvantages:
- Linear model cannot capture user-item interactions
- Still has cold-start issues
- Ignores review text content
- Complexity: O(n × k) training for k iterations, O(1) prediction
- Implementation: We use alternating least squares with L2 regularization, converging in ~10 iterations
# Cell: Baseline Model 3 - Bias Model (α + bu + bi)
print("\n" + "="*80)
print("BASELINE 3: BIAS MODEL (α + bu + bi) ")
print("="*80)
class BiasModel:
"""Efficient bias model using vectorized operations"""
def __init__(self, lambda_reg=10.0):
self.lambda_reg = lambda_reg
self.global_mean = 0
self.user_bias = {}
self.item_bias = {}
def fit(self, df, n_iterations=10):
"""Fit model using vectorized alternating least squares"""
self.global_mean = df['rating'].mean()
# Get unique IDs
unique_users = df['user_id'].unique()
unique_items = df['asin'].unique()
# Initialize biases to zero
self.user_bias = {user: 0.0 for user in unique_users}
self.item_bias = {item: 0.0 for item in unique_items}
print(f"Training with λ={self.lambda_reg}, iterations={n_iterations}")
print(f"Users: {len(unique_users):,}, Items: {len(unique_items):,}")
# Create temporary columns for faster access
df_work = df[['user_id', 'asin', 'rating']].copy()
# Alternating optimization
for iteration in range(n_iterations):
# Update user biases (vectorized)
# For each user: bu = sum(r - α - bi) / (n_u + λ)
df_work['item_bias'] = df_work['asin'].map(self.item_bias)
user_stats = df_work.groupby('user_id').agg({
'rating': ['sum', 'count'],
'item_bias': 'sum'
})
user_stats.columns = ['rating_sum', 'count', 'item_bias_sum']
# Calculate user biases
numerator = user_stats['rating_sum'] - (self.global_mean * user_stats['count']) - user_stats['item_bias_sum']
denominator = user_stats['count'] + self.lambda_reg
new_user_bias = (numerator / denominator).to_dict()
self.user_bias.update(new_user_bias)
# Update item biases (vectorized)
# For each item: bi = sum(r - α - bu) / (n_i + λ)
df_work['user_bias'] = df_work['user_id'].map(self.user_bias)
item_stats = df_work.groupby('asin').agg({
'rating': ['sum', 'count'],
'user_bias': 'sum'
})
item_stats.columns = ['rating_sum', 'count', 'user_bias_sum']
# Calculate item biases
numerator = item_stats['rating_sum'] - (self.global_mean * item_stats['count']) - item_stats['user_bias_sum']
denominator = item_stats['count'] + self.lambda_reg
new_item_bias = (numerator / denominator).to_dict()
self.item_bias.update(new_item_bias)
# Calculate and print RMSE every few iterations
if (iteration + 1) % 2 == 0 or iteration == 0:
preds = self.predict(df)
rmse = np.sqrt(mean_squared_error(df['rating'], preds))
print(f" Iteration {iteration+1:2d}: RMSE = {rmse:.4f}")
def predict(self, df):
"""Make predictions (vectorized)"""
predictions = (
self.global_mean +
df['user_id'].map(self.user_bias).fillna(0.0) +
df['asin'].map(self.item_bias).fillna(0.0)
).values
return predictions
# Train bias model
import time
start_time = time.time()
bias_model = BiasModel(lambda_reg=10.0)
bias_model.fit(df_train, n_iterations=10)
end_time = time.time()
print(f"\nTraining completed in {end_time - start_time:.2f} seconds")
# Predictions
pred_train_bias = bias_model.predict(df_train)
pred_val_bias = bias_model.predict(df_val)
pred_test_bias = bias_model.predict(df_test)
results['Bias Model'] = {
'train': evaluate_model(y_train, pred_train_bias, "Train"),
'val': evaluate_model(y_val, pred_val_bias, "Validation"),
'test': evaluate_model(y_test, pred_test_bias, "Test")
}
================================================================================ BASELINE 3: BIAS MODEL (α + bu + bi) ================================================================================ Training with λ=10.0, iterations=10 Users: 38,635, Items: 95,121 Iteration 1: RMSE = 1.0401 Iteration 2: RMSE = 1.0433 Iteration 4: RMSE = 1.0434 Iteration 6: RMSE = 1.0434 Iteration 8: RMSE = 1.0434 Iteration 10: RMSE = 1.0434 Training completed in 2.67 seconds Train Results: MSE: 1.0886 RMSE: 1.0434 MAE: 0.7852 Accuracy: 0.4007 Validation Results: MSE: 1.6079 RMSE: 1.2680 MAE: 0.9539 Accuracy: 0.2756 Test Results: MSE: 1.6479 RMSE: 1.2837 MAE: 0.9778 Accuracy: 0.2456
Advanced Models:¶
4. TF-IDF + Ridge Regression (Content-Based)
- Approach: Extract text features using TF-IDF, train Ridge regression
- Feature Engineering:
- TF-IDF with 5,000 max features
- Unigrams and bigrams (1-2 word phrases)
- Min document frequency: 5 (removes ultra-rare words)
- Max document frequency: 70% (removes overly common words)
- Regularization: α=1.0
- Advantages:
- Best performing model (Test RMSE: 0.935)
- Captures sentiment and product-specific language
- No cold-start problem for new users/items (uses text only)
- Can identify positive/negative words
- Generalizes well across rating values
- Disadvantages:
- Computationally expensive (TF-IDF computation + large feature matrix)
- Memory intensive (190k × 5k sparse matrix)
- Doesn't leverage collaborative signals
- Assumes text length and vocabulary are predictive
- Complexity: O(n × f) training where f=5,000 features, O(f) prediction
- Key Insight: High-rated reviews contain words like "great", "excellent", "perfect" while low-rated reviews contain "broken", "defective", "returned"
# Cell: Advanced Model 1 - Text-Based Model (TF-IDF + Ridge Regression)
print("\n" + "="*80)
print("ADVANCED MODEL 1: TEXT-BASED (TF-IDF + RIDGE REGRESSION)")
print("="*80)
# TF-IDF vectorization
print("Computing TF-IDF features...")
tfidf = TfidfVectorizer(
max_features=5000,
min_df=5,
max_df=0.7,
stop_words='english',
ngram_range=(1, 2)
)
X_train_tfidf = tfidf.fit_transform(df_train['text'].astype(str))
X_val_tfidf = tfidf.transform(df_val['text'].astype(str))
X_test_tfidf = tfidf.transform(df_test['text'].astype(str))
print(f"TF-IDF matrix shape: {X_train_tfidf.shape}")
print(f"Number of features: {X_train_tfidf.shape[1]:,}")
# Train Ridge regression model
print("\nTraining Ridge Regression model...")
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_tfidf, y_train)
# Predictions
pred_train_text = ridge_model.predict(X_train_tfidf)
pred_val_text = ridge_model.predict(X_val_tfidf)
pred_test_text = ridge_model.predict(X_test_tfidf)
results['TF-IDF + Ridge'] = {
'train': evaluate_model(y_train, pred_train_text, "Train"),
'val': evaluate_model(y_val, pred_val_text, "Validation"),
'test': evaluate_model(y_test, pred_test_text, "Test")
}
# Show most important words for positive/negative predictions
print("\nTop words associated with high ratings:")
feature_names = tfidf.get_feature_names_out()
top_positive_idx = np.argsort(ridge_model.coef_)[-15:]
for idx in reversed(top_positive_idx):
print(f" {feature_names[idx]:20s}: {ridge_model.coef_[idx]:>7.4f}")
print("\nTop words associated with low ratings:")
top_negative_idx = np.argsort(ridge_model.coef_)[:15]
for idx in top_negative_idx:
print(f" {feature_names[idx]:20s}: {ridge_model.coef_[idx]:>7.4f}")
================================================================================ ADVANCED MODEL 1: TEXT-BASED (TF-IDF + RIDGE REGRESSION) ================================================================================ Computing TF-IDF features... TF-IDF matrix shape: (190000, 5000) Number of features: 5,000 Training Ridge Regression model... Train Results: MSE: 0.7163 RMSE: 0.8464 MAE: 0.6125 Accuracy: 0.5628 Validation Results: MSE: 0.8038 RMSE: 0.8965 MAE: 0.6465 Accuracy: 0.5509 Test Results: MSE: 0.8733 RMSE: 0.9345 MAE: 0.6764 Accuracy: 0.5286 Top words associated with high ratings: sent new : 3.0477 sent replacement : 2.8079 blown away : 2.2966 wasn sure : 1.8479 sample : 1.7434 does says : 1.5010 does supposed : 1.4973 exchange honest : 1.4897 didn know : 1.4807 didn want : 1.4329 flawlessly : 1.4240 worried : 1.3948 ll update : 1.3885 fantastic : 1.3732 hours day : 1.3704 Top words associated with low ratings: worst : -4.0290 returning : -4.0029 useless : -3.7890 worthless : -3.7499 waste : -3.6534 disappointing : -3.5505 unusable : -3.3503 horrible : -3.3124 poor : -3.2972 poorly : -3.2953 worse : -3.2678 disappointment : -3.1658 terrible : -3.1290 impossible : -2.9602 junk : -2.9302
5. Matrix Factorization (Latent Factor Model)
- Approach: Learn latent representations of users and items via SGD
- Mathematical Formulation:
- r̂ui = α + bu + bi + pu · qi
- where pu = k-dimensional latent vector for user u
- qi = k-dimensional latent vector for item i
- k = 20 (number of latent factors)
- Training: Stochastic Gradient Descent over 20 epochs
- Learning rate: η = 0.005
- Regularization: λ = 0.02
- Converges in ~60 seconds
- Advantages:
- Captures latent user preferences and item characteristics
- Models user-item interactions beyond simple biases
- Very similar to bias model (Test RMSE: 1.283 vs 1.284)
- Classic recommender system approach
- Disadvantages:
- Severe cold-start problem (needs user/item history)
- Hyperparameter sensitive (k, η, λ require a lot of tuning)
- Risk of overfitting on popular items
- Computationally expensive for large k
- Complexity: O(n × k × e) for e epochs, O(k) prediction
- Implementation Challenge: SGD requires careful tuning - too high learning rate causes divergence, too low requires many epochs
# Cell: Advanced Model 2 - Matrix Factorization (Latent Factor Model)
print("\n" + "="*80)
print("ADVANCED MODEL 2: MATRIX FACTORIZATION (LATENT FACTORS)")
print("="*80)
class MatrixFactorization:
"""Matrix Factorization with biases using SGD"""
def __init__(self, n_factors=20, learning_rate=0.005, lambda_reg=0.02):
self.n_factors = n_factors
self.learning_rate = learning_rate
self.lambda_reg = lambda_reg
self.global_mean = 0
self.user_bias = {}
self.item_bias = {}
self.user_factors = {}
self.item_factors = {}
def fit(self, df, n_epochs=20, verbose=True):
"""Train using Stochastic Gradient Descent"""
import time
start_time = time.time()
self.global_mean = df['rating'].mean()
# Get unique IDs
unique_users = df['user_id'].unique()
unique_items = df['asin'].unique()
# Initialize biases
self.user_bias = {user: 0.0 for user in unique_users}
self.item_bias = {item: 0.0 for item in unique_items}
# Initialize latent factors with small random values
np.random.seed(42)
self.user_factors = {
user: np.random.normal(0, 0.1, self.n_factors)
for user in unique_users
}
self.item_factors = {
item: np.random.normal(0, 0.1, self.n_factors)
for item in unique_items
}
print(f"Training with {self.n_factors} latent factors")
print(f"Learning rate: {self.learning_rate}, Regularization: {self.lambda_reg}")
print(f"Users: {len(unique_users):,}, Items: {len(unique_items):,}")
# Convert to arrays for faster iteration
users = df['user_id'].values
items = df['asin'].values
ratings = df['rating'].values
n_samples = len(df)
# Training loop
for epoch in range(n_epochs):
# Shuffle data
indices = np.random.permutation(n_samples)
epoch_loss = 0
for idx in indices:
user = users[idx]
item = items[idx]
rating = ratings[idx]
# Get current parameters
bu = self.user_bias[user]
bi = self.item_bias[item]
pu = self.user_factors[user]
qi = self.item_factors[item]
# Prediction
pred = self.global_mean + bu + bi + np.dot(pu, qi)
# Error
error = rating - pred
epoch_loss += error ** 2
# Update biases
self.user_bias[user] += self.learning_rate * (error - self.lambda_reg * bu)
self.item_bias[item] += self.learning_rate * (error - self.lambda_reg * bi)
# Update latent factors
self.user_factors[user] += self.learning_rate * (error * qi - self.lambda_reg * pu)
self.item_factors[item] += self.learning_rate * (error * pu - self.lambda_reg * qi)
# Print progress
if verbose and ((epoch + 1) % 5 == 0 or epoch == 0):
rmse = np.sqrt(epoch_loss / n_samples)
elapsed = time.time() - start_time
print(f" Epoch {epoch+1:2d}/{n_epochs}: RMSE = {rmse:.4f} ({elapsed:.1f}s)")
total_time = time.time() - start_time
print(f"\nTraining completed in {total_time:.2f} seconds")
def predict(self, df):
"""Make predictions"""
predictions = []
for _, row in df.iterrows():
user = row['user_id']
item = row['asin']
# Base prediction
pred = self.global_mean
# Add biases
pred += self.user_bias.get(user, 0.0)
pred += self.item_bias.get(item, 0.0)
# Add latent factor interaction
if user in self.user_factors and item in self.item_factors:
pred += np.dot(self.user_factors[user], self.item_factors[item])
predictions.append(pred)
return np.array(predictions)
# Train matrix factorization model
mf_model = MatrixFactorization(n_factors=20, learning_rate=0.005, lambda_reg=0.02)
mf_model.fit(df_train, n_epochs=20, verbose=True)
# Predictions
print("\nGenerating predictions...")
pred_train_mf = mf_model.predict(df_train)
pred_val_mf = mf_model.predict(df_val)
pred_test_mf = mf_model.predict(df_test)
results['Matrix Factorization'] = {
'train': evaluate_model(y_train, pred_train_mf, "Train"),
'val': evaluate_model(y_val, pred_val_mf, "Validation"),
'test': evaluate_model(y_test, pred_test_mf, "Test")
}
================================================================================ ADVANCED MODEL 2: MATRIX FACTORIZATION (LATENT FACTORS) ================================================================================ Training with 20 latent factors Learning rate: 0.005, Regularization: 0.02 Users: 38,635, Items: 95,121 Epoch 1/20: RMSE = 1.2384 (1.5s) Epoch 5/20: RMSE = 1.1552 (6.7s) Epoch 10/20: RMSE = 1.0833 (13.1s) Epoch 15/20: RMSE = 1.0186 (19.6s) Epoch 20/20: RMSE = 0.9673 (26.0s) Training completed in 26.05 seconds Generating predictions... Train Results: MSE: 0.9145 RMSE: 0.9563 MAE: 0.7068 Accuracy: 0.4833 Validation Results: MSE: 1.5992 RMSE: 1.2646 MAE: 0.9431 Accuracy: 0.3046 Test Results: MSE: 1.6453 RMSE: 1.2827 MAE: 0.9714 Accuracy: 0.2679
6. Ensemble Model (Weighted Combination)
- Approach: Combine predictions from Bias, TF-IDF, and Matrix Factorization
- Weighting Strategy: Weights proportional to 1/RMSE_validation
- TF-IDF + Ridge: 0.457 (strongest weight)
- Matrix Factorization: 0.273
- Bias Model: 0.270
- Advantages:
- Combines strengths of collaborative and content-based approaches
- Robust to individual model failures
- Second-best performance (Test RMSE: 1.057)
- Reduces variance through averaging
- Disadvantages:
- Increased complexity and computational cost
- Requires training multiple models
- Less interpretable than individual models
- Doesn't improve beyond best component (TF-IDF)
- Complexity: Sum of component complexities
# Cell: Advanced Model 3 - Ensemble Model (Weighted Average)
print("\n" + "="*80)
print("ADVANCED MODEL 3: ENSEMBLE MODEL (WEIGHTED COMBINATION)")
print("="*80)
# We'll combine predictions from multiple models
# Weights will be inversely proportional to validation RMSE
print("Computing ensemble weights based on validation performance...")
# Get validation RMSEs for models we want to ensemble
ensemble_models = {
'Bias Model': pred_val_bias,
'TF-IDF + Ridge': pred_val_text,
'Matrix Factorization': pred_val_mf
}
# Calculate weights (inverse of RMSE)
weights = {}
for model_name, preds in ensemble_models.items():
rmse = np.sqrt(mean_squared_error(y_val, preds))
weights[model_name] = 1.0 / rmse
# Normalize weights
total_weight = sum(weights.values())
normalized_weights = {k: v/total_weight for k, v in weights.items()}
print("\nEnsemble weights:")
for model_name, weight in normalized_weights.items():
print(f" {model_name:25s}: {weight:.4f}")
# Create ensemble predictions
def ensemble_predict(models_dict, weights_dict):
"""Weighted average of multiple model predictions"""
ensemble = np.zeros(len(next(iter(models_dict.values()))))
for model_name, predictions in models_dict.items():
ensemble += predictions * weights_dict[model_name]
return ensemble
# Train set predictions
train_models = {
'Bias Model': pred_train_bias,
'TF-IDF + Ridge': pred_train_text,
'Matrix Factorization': pred_train_mf
}
pred_train_ensemble = ensemble_predict(train_models, normalized_weights)
# Validation set predictions
pred_val_ensemble = ensemble_predict(ensemble_models, normalized_weights)
# Test set predictions
test_models = {
'Bias Model': pred_test_bias,
'TF-IDF + Ridge': pred_test_text,
'Matrix Factorization': pred_test_mf
}
pred_test_ensemble = ensemble_predict(test_models, normalized_weights)
results['Ensemble Model'] = {
'train': evaluate_model(y_train, pred_train_ensemble, "Train"),
'val': evaluate_model(y_val, pred_val_ensemble, "Validation"),
'test': evaluate_model(y_test, pred_test_ensemble, "Test")
}
================================================================================ ADVANCED MODEL 3: ENSEMBLE MODEL (WEIGHTED COMBINATION) ================================================================================ Computing ensemble weights based on validation performance... Ensemble weights: Bias Model : 0.2926 TF-IDF + Ridge : 0.4139 Matrix Factorization : 0.2934 Train Results: MSE: 0.7372 RMSE: 0.8586 MAE: 0.6424 Accuracy: 0.5313 Validation Results: MSE: 1.0603 RMSE: 1.0297 MAE: 0.7727 Accuracy: 0.4356 Test Results: MSE: 1.1181 RMSE: 1.0574 MAE: 0.8014 Accuracy: 0.3957
Part 4 — Model Evaluation and Comparison¶
4.1 Context and Justification¶
We evaluate our models using multiple complementary metrics to understand different aspects of prediction quality:
Primary Metrics:
- RMSE (Root Mean Squared Error): Penalizes large errors more heavily, appropriate for rating prediction
- MAE (Mean Absolute Error): Interpretable in rating units (e.g., MAE=0.68 means average error of 0.68 stars)
Secondary Metrics:
- Classification Accuracy: After rounding to nearest integer (1-5), what percentage are exactly correct?
- Confusion Matrix: Shows systematic biases (e.g., do we over-predict 5-stars?)
- R² (Coefficient of Determination): Measures proportion of variance explained
Why These Metrics?
RMSE and MAE are appropriate because ratings are fundamentally continuous, we care about error magnitude, and they enable gradient-based optimization. Classification accuracy is less meaningful due to the 5-star positive skew.
Test Set Results (Sorted by RMSE):
| Model | RMSE | MAE | Accuracy | Improvement |
|---|---|---|---|---|
| TF-IDF + Ridge | 0.935 | 0.676 | 52.9% | 29.7% |
| Ensemble | 1.057 | 0.801 | 39.6% | 20.5% |
| Matrix Fact. | 1.283 | 0.971 | 26.8% | 3.5% |
| Bias Model | 1.284 | 0.978 | 24.6% | 3.4% |
| Global Mean | 1.329 | 1.037 | 12.9% | 0.0% |
4.2 Discussing Baselines and Effectiveness of Our Models¶
Key Finding 1: TF-IDF Dominates
The text-based model achieves the best performance by a large margin (RMSE: 0.935), with 29.7% improvement over baseline. This demonstrates that review text contains rich sentiment signals that are highly predictive of ratings. The model successfully learns that words like "excellent" and "perfect" predict high ratings, while "broken" and "defective" predict low ratings.
Key Finding 2: Cold-Start Resilience
TF-IDF maintains consistent performance (~0.68 MAE) across all user types, from new users (0-1 reviews) to heavy users (20+ reviews). In contrast, collaborative methods degrade significantly for new users:
- New users: Bias Model MAE = 1.082 vs TF-IDF MAE = 0.681 (0.401 difference)
- Heavy users: Bias Model MAE = 0.834 vs TF-IDF MAE = 0.668 (0.166 difference)
This makes TF-IDF ideal for production systems where most users and items have limited history.
Key Finding 3: Collaborative Methods Struggle
Matrix Factorization and Bias Model barely beat the global mean baseline (RMSE ~1.28 vs 1.33). This is surprising but explained by:
- Extreme sparsity: 190k ratings across 87k users × 45k items (>99.9% sparse)
- Temporal shift: Test set contains newer products not seen in training
- Cold-start dominance: 60% of test users and 45% of test items are new/rare
Key Finding 4: Ensemble Doesn't Help
The ensemble achieves second-best performance (RMSE: 1.057) but doesn't beat TF-IDF alone. This occurs because TF-IDF dominates the weighting (45.7%) and adding weaker models slightly hurts performance. Ensembles work best when component models have similar accuracy but different error patterns - here, TF-IDF is simply superior.
Stratified Performance:
Performance varies systematically across data segments:
- 5-star reviews are easiest to predict (most common, clear sentiment)
- 1-2 star reviews are harder (less common, more variable language)
- 3-star reviews are most challenging (ambiguous sentiment)
Residual Analysis:
The TF-IDF model shows excellent calibration:
- Mean residual: -0.003 (near zero, unbiased)
- Residual std: 0.934 (matches RMSE)
- Errors follow approximately normal distribution
- Slight regression to mean (underpredicts low ratings, overpredicts high ratings)
4.3 Code Walkthrough¶
Our evaluation framework creates comprehensive comparisons across all models and metrics. We visualize performance using multiple complementary views:
- Grouped bar charts: Compare RMSE/MAE/Accuracy across train/val/test
- Mean predictions with error bars: Show calibration and uncertainty
- Residual diagnostics: Confirm model assumptions and identify biases
- Stratified analysis: Reveal cold-start and popularity effects
The code uses pandas DataFrames for structured comparison and matplotlib/seaborn for visualizations. See cells below for detailed implementation.
# Cell: Step 4 - Model Comparison and Evaluation
print("\n" + "="*80)
print("STEP 4: MODEL EVALUATION AND COMPARISON")
print("="*80)
# Create comparison dataframe
comparison_data = []
for model_name, metrics in results.items():
for dataset in ['train', 'val', 'test']:
comparison_data.append({
'Model': model_name,
'Dataset': dataset,
'RMSE': metrics[dataset]['rmse'],
'MAE': metrics[dataset]['mae'],
'Accuracy': metrics[dataset]['accuracy']
})
comparison_df = pd.DataFrame(comparison_data)
# Display results table
print("\n" + "="*80)
print("MODEL PERFORMANCE COMPARISON")
print("="*80)
print("\nValidation Set Results (sorted by RMSE):")
val_results = comparison_df[comparison_df['Dataset'] == 'val'].sort_values('RMSE')
print(val_results.to_string(index=False))
print("\n\nTest Set Results (sorted by RMSE):")
test_results = comparison_df[comparison_df['Dataset'] == 'test'].sort_values('RMSE')
print(test_results.to_string(index=False))
================================================================================
STEP 4: MODEL EVALUATION AND COMPARISON
================================================================================
================================================================================
MODEL PERFORMANCE COMPARISON
================================================================================
Validation Set Results (sorted by RMSE):
Model Dataset RMSE MAE Accuracy
TF-IDF + Ridge val 0.896528 0.646464 0.5509
Ensemble Model val 1.029727 0.772735 0.4356
Matrix Factorization val 1.264601 0.943112 0.3046
Bias Model val 1.268012 0.953909 0.2756
Global Mean val 1.329897 1.041435 0.1137
User+Item Mean val 1.333673 0.910795 0.4290
User Mean val 1.344215 0.908649 0.4163
Item Mean val 1.398270 1.014123 0.2620
Test Set Results (sorted by RMSE):
Model Dataset RMSE MAE Accuracy
TF-IDF + Ridge test 0.934517 0.676354 0.52864
Ensemble Model test 1.057423 0.801450 0.39568
Matrix Factorization test 1.282693 0.971405 0.26792
Bias Model test 1.283699 0.977817 0.24564
Global Mean test 1.329293 1.037363 0.12884
User+Item Mean test 1.374811 0.961010 0.38706
Item Mean test 1.377299 1.026095 0.21244
User Mean test 1.378620 0.960425 0.37722
# Cell: Visualization 1 - Model Performance Comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
metrics_to_plot = ['RMSE', 'MAE', 'Accuracy']
colors = {'train': '#3498db', 'val': '#e74c3c', 'test': '#2ecc71'}
for idx, metric in enumerate(metrics_to_plot):
ax = axes[idx]
# Prepare data for plotting
plot_data = comparison_df.pivot(index='Model', columns='Dataset', values=metric)
plot_data = plot_data[['train', 'val', 'test']] # Ensure order
# Create grouped bar chart
x = np.arange(len(plot_data.index))
width = 0.25
for i, dataset in enumerate(['train', 'val', 'test']):
offset = width * (i - 1)
ax.bar(x + offset, plot_data[dataset], width,
label=dataset.capitalize(), color=colors[dataset], alpha=0.8, edgecolor='black')
ax.set_xlabel('Model', fontweight='bold', fontsize=11)
ax.set_ylabel(metric, fontweight='bold', fontsize=11)
ax.set_title(f'{metric} Comparison', fontweight='bold', fontsize=13)
ax.set_xticks(x)
ax.set_xticklabels(plot_data.index, rotation=45, ha='right', fontsize=9)
ax.legend(fontsize=9)
ax.grid(axis='y', alpha=0.3, linestyle='--')
plt.suptitle('Model Performance Comparison Across Metrics', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()
# Cell: Visualization 2 - Prediction vs Actual Combined Heatmap + Mean/Std
fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 3, hspace=0.4, wspace=0.3)
models_to_viz = [
('Global Mean', pred_test_global),
('Item Mean', pred_test_item),
('Bias Model', pred_test_bias),
('TF-IDF + Ridge', pred_test_text),
('Matrix Factorization', pred_test_mf),
('Ensemble Model', pred_test_ensemble)
]
for idx, (model_name, predictions) in enumerate(models_to_viz):
ax = fig.add_subplot(gs[idx // 3, idx % 3])
# Create prediction data by actual rating
plot_data = pd.DataFrame({
'actual': y_test.astype(int),
'predicted': predictions
})
# Calculate mean and std for each actual rating
stats = plot_data.groupby('actual')['predicted'].agg(['mean', 'std', 'count'])
# Create bars for mean predictions
colors = ['#d7191c', '#fdae61', '#ffffbf', '#a6d96a', '#1a9641']
bars = ax.bar(stats.index, stats['mean'], color=colors, alpha=0.7,
edgecolor='black', linewidth=2, width=0.6)
# Add error bars for std
ax.errorbar(stats.index, stats['mean'], yerr=stats['std'],
fmt='none', ecolor='black', capsize=5, capthick=2, alpha=0.8)
# Add perfect prediction line
ax.plot([0.5, 5.5], [0.5, 5.5], 'r--', linewidth=3, label='Perfect Prediction', alpha=0.8)
# Add count annotations
for i, (rating, row) in enumerate(stats.iterrows()):
ax.text(rating, row['mean'] + row['std'] + 0.15,
f"n={int(row['count']):,}",
ha='center', fontsize=8, fontweight='bold')
# Calculate metrics
rmse = np.sqrt(mean_squared_error(y_test, predictions))
mae = mean_absolute_error(y_test, predictions)
ax.set_xlabel('Actual Rating', fontweight='bold', fontsize=11)
ax.set_ylabel('Mean Predicted Rating (±1 SD)', fontweight='bold', fontsize=11)
ax.set_title(f'{model_name}\nRMSE = {rmse:.4f}, MAE = {mae:.4f}',
fontweight='bold', fontsize=12)
ax.set_xlim(0.5, 5.5)
ax.set_ylim(0, 6)
ax.set_xticks(range(1, 6))
ax.set_xticklabels([f'{i}' for i in range(1, 6)])
ax.legend(fontsize=9, loc='upper left')
ax.grid(True, alpha=0.3, axis='y', linestyle='--')
plt.suptitle('Mean Predicted Rating by Actual Rating (with Std Dev)',
fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
# Cell: Visualization 2.5 - COMPREHENSIVE: Three Views Side by Side
# Pick the best performing models
top_3_models = test_results.head(3)['Model'].tolist()
fig = plt.figure(figsize=(18, len(top_3_models) * 4))
gs = fig.add_gridspec(len(top_3_models), 3, hspace=0.3, wspace=0.3)
model_preds = {
'Global Mean': pred_test_global,
'User Mean': pred_test_user,
'Item Mean': pred_test_item,
'User+Item Mean': pred_test_combined,
'Bias Model': pred_test_bias,
'TF-IDF + Ridge': pred_test_text,
'Matrix Factorization': pred_test_mf,
'Ensemble Model': pred_test_ensemble
}
colors_rating = {1: '#d7191c', 2: '#fdae61', 3: '#ffffbf', 4: '#a6d96a', 5: '#1a9641'}
for row_idx, model_name in enumerate(top_3_models):
predictions = model_preds[model_name]
# Panel 1: Mean + Std bars
ax1 = fig.add_subplot(gs[row_idx, 0])
plot_data = pd.DataFrame({'actual': y_test.astype(int), 'predicted': predictions})
stats = plot_data.groupby('actual')['predicted'].agg(['mean', 'std'])
colors = [colors_rating[i] for i in range(1, 6)]
ax1.bar(stats.index, stats['mean'], color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)
ax1.errorbar(stats.index, stats['mean'], yerr=stats['std'],
fmt='none', ecolor='black', capsize=5, capthick=2)
ax1.plot([0.5, 5.5], [0.5, 5.5], 'r--', linewidth=2)
ax1.set_xlabel('Actual Rating', fontweight='bold')
ax1.set_ylabel('Mean Predicted ±SD', fontweight='bold')
ax1.set_title(f'{model_name}: Mean Predictions', fontweight='bold')
ax1.set_xlim(0.5, 5.5)
ax1.grid(axis='y', alpha=0.3)
# Panel 2: Heatmap
ax2 = fig.add_subplot(gs[row_idx, 1])
pred_rounded = np.clip(np.round(predictions), 1, 5).astype(int)
hist_matrix = np.zeros((5, 5))
for actual, pred in zip(y_test.astype(int), pred_rounded):
hist_matrix[actual-1, pred-1] += 1
hist_matrix_pct = hist_matrix / hist_matrix.sum(axis=1, keepdims=True) * 100
sns.heatmap(hist_matrix_pct, annot=True, fmt='.0f', cmap='RdYlGn',
center=20, vmin=0, vmax=100, ax=ax2, cbar=False,
xticklabels=range(1, 6), yticklabels=range(1, 6))
ax2.set_xlabel('Predicted Rating', fontweight='bold')
ax2.set_ylabel('Actual Rating', fontweight='bold')
ax2.set_title(f'{model_name}: Distribution %', fontweight='bold')
# Panel 3: Violin plot
ax3 = fig.add_subplot(gs[row_idx, 2])
parts = ax3.violinplot(
[plot_data[plot_data['actual'] == i]['predicted'].values for i in range(1, 6)],
positions=range(1, 6), widths=0.6, showmeans=True, showmedians=True
)
for i, pc in enumerate(parts['bodies']):
pc.set_facecolor(colors_rating[i+1])
pc.set_alpha(0.6)
ax3.plot([0.5, 5.5], [0.5, 5.5], 'r--', linewidth=2)
ax3.set_xlabel('Actual Rating', fontweight='bold')
ax3.set_ylabel('Predicted Distribution', fontweight='bold')
ax3.set_title(f'{model_name}: Spread', fontweight='bold')
ax3.set_xlim(0.5, 5.5)
ax3.set_ylim(0.5, 5.5)
ax3.grid(axis='y', alpha=0.3)
plt.suptitle('Top Models: Three Perspectives on Prediction Quality',
fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
# Cell: Visualization 3 - Residual Analysis for Best Model
# Determine best model from validation set
best_model_idx = val_results.iloc[0]
best_model_name = best_model_idx['Model']
# Get predictions for best model
model_predictions = {
'Global Mean': pred_test_global,
'User Mean': pred_test_user,
'Item Mean': pred_test_item,
'User+Item Mean': pred_test_combined,
'Bias Model': pred_test_bias,
'TF-IDF + Ridge': pred_test_text,
'Matrix Factorization': pred_test_mf,
'Ensemble Model': pred_test_ensemble
}
best_predictions = model_predictions[best_model_name]
print(f"Analyzing residuals for best model: {best_model_name}")
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
residuals = y_test - best_predictions
pred_rounded = np.clip(np.round(best_predictions), 1, 5).astype(int)
# Panel 1: Residual distribution
ax1 = axes[0, 0]
ax1.hist(residuals, bins=50, color='steelblue', alpha=0.7, edgecolor='black')
ax1.axvline(x=0, color='red', linestyle='--', linewidth=2)
ax1.set_xlabel('Residual (Actual - Predicted)', fontweight='bold')
ax1.set_ylabel('Frequency', fontweight='bold')
ax1.set_title(f'Residual Distribution ({best_model_name})\nMean: {residuals.mean():.4f}, Std: {residuals.std():.4f}',
fontweight='bold')
ax1.grid(axis='y', alpha=0.3)
# Panel 2: Residuals vs Predicted
ax2 = axes[0, 1]
ax2.scatter(best_predictions, residuals, alpha=0.3, s=10, c='coral')
ax2.axhline(y=0, color='red', linestyle='--', linewidth=2)
ax2.set_xlabel('Predicted Rating', fontweight='bold')
ax2.set_ylabel('Residual', fontweight='bold')
ax2.set_title('Residuals vs Predicted Values', fontweight='bold')
ax2.grid(True, alpha=0.3)
# Panel 3: Confusion Matrix
ax3 = axes[1, 0]
cm = confusion_matrix(y_test, pred_rounded)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax3, cbar_kws={'label': 'Count'})
ax3.set_xlabel('Predicted Rating', fontweight='bold')
ax3.set_ylabel('Actual Rating', fontweight='bold')
ax3.set_title(f'Confusion Matrix ({best_model_name})', fontweight='bold')
# Panel 4: Error by actual rating
ax4 = axes[1, 1]
error_by_rating = pd.DataFrame({
'actual': y_test,
'abs_error': np.abs(residuals)
})
error_by_rating.boxplot(column='abs_error', by='actual', ax=ax4, patch_artist=True, showfliers=False)
ax4.set_xlabel('Actual Rating', fontweight='bold')
ax4.set_ylabel('Absolute Error', fontweight='bold')
ax4.set_title('Prediction Error by Actual Rating', fontweight='bold')
ax4.get_figure().suptitle('') # Remove default title
plt.sca(ax4)
plt.xticks(rotation=0)
ax4.grid(axis='y', alpha=0.3)
plt.suptitle(f'Residual Analysis - {best_model_name} (Test Set)', fontsize=15, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()
Analyzing residuals for best model: TF-IDF + Ridge
# Cell: Visualization 4 - Model Comparison on Different Data Segments
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
# Get user and item review counts from training data
user_review_counts = df_train.groupby('user_id').size()
item_review_counts = df_train.groupby('asin').size()
# Map to test set
df_test_analysis = df_test.copy()
df_test_analysis['user_review_count'] = df_test_analysis['user_id'].map(user_review_counts).fillna(0)
df_test_analysis['item_review_count'] = df_test_analysis['asin'].map(item_review_counts).fillna(0)
df_test_analysis['abs_error'] = np.abs(y_test - best_predictions)
# Compare top models on different segments
top_models = {
'Bias Model': pred_test_bias,
'TF-IDF + Ridge': pred_test_text,
'Matrix Factorization': pred_test_mf,
'Ensemble': pred_test_ensemble
}
# Panel 1: Performance by rating value
ax1 = axes[0, 0]
rating_performance = []
for rating_val in range(1, 6):
mask = y_test == rating_val
for model_name, preds in top_models.items():
mae = mean_absolute_error(y_test[mask], preds[mask])
rating_performance.append({
'Rating': rating_val,
'Model': model_name,
'MAE': mae
})
rating_perf_df = pd.DataFrame(rating_performance)
for model_name in top_models.keys():
model_data = rating_perf_df[rating_perf_df['Model'] == model_name]
ax1.plot(model_data['Rating'], model_data['MAE'], marker='o', label=model_name, linewidth=2)
ax1.set_xlabel('Actual Rating', fontweight='bold')
ax1.set_ylabel('MAE', fontweight='bold')
ax1.set_title('Model Performance by Rating Value', fontweight='bold')
ax1.legend(fontsize=9)
ax1.grid(True, alpha=0.3)
ax1.set_xticks(range(1, 6))
# Panel 2: Performance on cold-start users
ax2 = axes[0, 1]
cold_start_perf = []
for model_name, preds in top_models.items():
# New users (0-1 reviews in train)
new_users_mask = df_test_analysis['user_review_count'] <= 1
mae_new = mean_absolute_error(y_test[new_users_mask], preds[new_users_mask])
# Existing users (2+ reviews in train)
existing_users_mask = df_test_analysis['user_review_count'] > 1
mae_existing = mean_absolute_error(y_test[existing_users_mask], preds[existing_users_mask])
cold_start_perf.append({
'Model': model_name,
'New Users': mae_new,
'Existing Users': mae_existing
})
cold_df = pd.DataFrame(cold_start_perf)
x = np.arange(len(cold_df))
width = 0.35
ax2.bar(x - width/2, cold_df['New Users'], width, label='New Users', alpha=0.8, color='coral', edgecolor='black')
ax2.bar(x + width/2, cold_df['Existing Users'], width, label='Existing Users', alpha=0.8, color='steelblue', edgecolor='black')
ax2.set_ylabel('MAE', fontweight='bold')
ax2.set_title('Cold Start: New vs Existing Users', fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(cold_df['Model'], rotation=45, ha='right', fontsize=9)
ax2.legend()
ax2.grid(axis='y', alpha=0.3)
# Panel 3: Performance on cold-start items
ax3 = axes[1, 0]
cold_item_perf = []
for model_name, preds in top_models.items():
# New items (0-1 reviews in train)
new_items_mask = df_test_analysis['item_review_count'] <= 1
mae_new = mean_absolute_error(y_test[new_items_mask], preds[new_items_mask])
# Existing items (2+ reviews in train)
existing_items_mask = df_test_analysis['item_review_count'] > 1
mae_existing = mean_absolute_error(y_test[existing_items_mask], preds[existing_items_mask])
cold_item_perf.append({
'Model': model_name,
'New Items': mae_new,
'Existing Items': mae_existing
})
cold_item_df = pd.DataFrame(cold_item_perf)
x = np.arange(len(cold_item_df))
ax3.bar(x - width/2, cold_item_df['New Items'], width, label='New Items', alpha=0.8, color='#ff9999', edgecolor='black')
ax3.bar(x + width/2, cold_item_df['Existing Items'], width, label='Existing Items', alpha=0.8, color='#66b3ff', edgecolor='black')
ax3.set_ylabel('MAE', fontweight='bold')
ax3.set_title('Cold Start: New vs Existing Items', fontweight='bold')
ax3.set_xticks(x)
ax3.set_xticklabels(cold_item_df['Model'], rotation=45, ha='right', fontsize=9)
ax3.legend()
ax3.grid(axis='y', alpha=0.3)
# Panel 4: Overall comparison summary
ax4 = axes[1, 1]
summary_metrics = []
for model_name, preds in top_models.items():
summary_metrics.append({
'Model': model_name,
'RMSE': np.sqrt(mean_squared_error(y_test, preds)),
'MAE': mean_absolute_error(y_test, preds),
'Accuracy': accuracy_score(y_test, np.clip(np.round(preds), 1, 5))
})
summary_df = pd.DataFrame(summary_metrics)
x = np.arange(len(summary_df))
# Normalize metrics to 0-1 scale for comparison
summary_df['RMSE_norm'] = 1 - (summary_df['RMSE'] - summary_df['RMSE'].min()) / (summary_df['RMSE'].max() - summary_df['RMSE'].min())
summary_df['MAE_norm'] = 1 - (summary_df['MAE'] - summary_df['MAE'].min()) / (summary_df['MAE'].max() - summary_df['MAE'].min())
width = 0.25
ax4.bar(x - width, summary_df['RMSE_norm'], width, label='RMSE', alpha=0.8, color='#e74c3c')
ax4.bar(x, summary_df['MAE_norm'], width, label='MAE', alpha=0.8, color='#3498db')
ax4.bar(x + width, summary_df['Accuracy'], width, label='Accuracy', alpha=0.8, color='#2ecc71')
ax4.set_ylabel('Normalized Score (higher is better)', fontweight='bold')
ax4.set_title('Overall Model Comparison', fontweight='bold')
ax4.set_xticks(x)
ax4.set_xticklabels(summary_df['Model'], rotation=45, ha='right', fontsize=9)
ax4.legend()
ax4.grid(axis='y', alpha=0.3)
ax4.set_ylim(0, 1.1)
plt.suptitle('Comprehensive Model Comparison on Different Data Segments', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()
# Cell: Final Summary
print("\n" + "="*80)
print("FINAL SUMMARY")
print("="*80)
# Best model
best_model_name = test_results.iloc[0]['Model']
best_rmse = test_results.iloc[0]['RMSE']
best_mae = test_results.iloc[0]['MAE']
best_acc = test_results.iloc[0]['Accuracy']
print(f"\n Best Model: {best_model_name}")
print(f" Test RMSE: {best_rmse:.4f}")
print(f" Test MAE: {best_mae:.4f}")
print(f" Test Accuracy: {best_acc:.4f}")
# Improvement over baseline
baseline_rmse = results['Global Mean']['test']['rmse']
improvement = ((baseline_rmse - best_rmse) / baseline_rmse) * 100
print(f"\n Improvement over global mean baseline: {improvement:.2f}%")
# Compare model types
print("\n" + "="*80)
print("MODEL TYPE COMPARISON")
print("="*80)
print("\n1. Baseline Models (Simple Heuristics):")
for model in ['Global Mean', 'User Mean', 'Item Mean', 'User+Item Mean']:
if model in results:
rmse = results[model]['test']['rmse']
print(f" {model:20s}: RMSE = {rmse:.4f}")
print("\n2. Collaborative Filtering:")
for model in ['Bias Model', 'Matrix Factorization']:
if model in results:
rmse = results[model]['test']['rmse']
print(f" {model:20s}: RMSE = {rmse:.4f}")
print("\n3. Content-Based:")
for model in ['TF-IDF + Ridge']:
if model in results:
rmse = results[model]['test']['rmse']
print(f" {model:20s}: RMSE = {rmse:.4f}")
print("\n4. Hybrid/Ensemble:")
for model in ['Ensemble Model']:
if model in results:
rmse = results[model]['test']['rmse']
print(f" {model:20s}: RMSE = {rmse:.4f}")
================================================================================ FINAL SUMMARY ================================================================================ Best Model: TF-IDF + Ridge Test RMSE: 0.9345 Test MAE: 0.6764 Test Accuracy: 0.5286 Improvement over global mean baseline: 29.70% ================================================================================ MODEL TYPE COMPARISON ================================================================================ 1. Baseline Models (Simple Heuristics): Global Mean : RMSE = 1.3293 User Mean : RMSE = 1.3786 Item Mean : RMSE = 1.3773 User+Item Mean : RMSE = 1.3748 2. Collaborative Filtering: Bias Model : RMSE = 1.2837 Matrix Factorization: RMSE = 1.2827 3. Content-Based: TF-IDF + Ridge : RMSE = 0.9345 4. Hybrid/Ensemble: Ensemble Model : RMSE = 1.0574